X's in the predicted amino acid sequences

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 530
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

X's in the predicted amino acid sequences

Post by katharina »

Originally posted by Sunghee in the old forum on 04.05.2012 - 12:11

In the output of Augustus I have X's in the amino acid sequences.
E.g. SADXXXXXXXXXXXGELD....
Would you tell me what it is?
User avatar
katharina
Site Admin
Posts: 530
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: X's in the predicted amino acid sequences

Post by katharina »

Originally posted by Mario in the old forum on 04.05.2012 - 12:16

They stem from n's in the genome assembly.
On a hard-masked genome or when there are assembly gaps, the genome contains unknown nucleotides (N or n).
These are by default considered to be somewhat less likely to be in coding regions, but they still can be predicted to be in coding regions.
This is influenced by the parameter

/Constant/probNinCoding

which defaults to 0.23 for most species.

This compares to a model probability of 0.25 to see an n in a non-coding region. If you decrease this, you will see less and less stretches of XXX in the predictions. You may set this to 0, but then you will mispredict any gene whose coding region actually contains n's.
Post Reply