X's in the predicted amino acid sequences

Post by **katharina** » Wed Nov 18, 2015 6:36 pm

Originally posted by Sunghee in the old forum on 04.05.2012 - 12:11

In the output of Augustus I have X's in the amino acid sequences.
E.g. SADXXXXXXXXXXXGELD....
Would you tell me what it is?

Post by **katharina** » Wed Nov 18, 2015 6:37 pm

Originally posted by Mario in the old forum on 04.05.2012 - 12:16

They stem from n's in the genome assembly.
On a hard-masked genome or when there are assembly gaps, the genome contains unknown nucleotides (N or n).
These are by default considered to be somewhat less likely to be in coding regions, but they still can be predicted to be in coding regions.
This is influenced by the parameter

/Constant/probNinCoding

which defaults to 0.23 for most species.

This compares to a model probability of 0.25 to see an n in a non-coding region. If you decrease this, you will see less and less stretches of XXX in the predictions. You may set this to 0, but then you will mispredict any gene whose coding region actually contains n's.

AUGUSTUS Forum

X's in the predicted amino acid sequences

X's in the predicted amino acid sequences

Re: X's in the predicted amino acid sequences