Accuracy of AUGUSTUS


AUGUSTUS is used in many genome annotation projects. Below are some accuracy values in comparison to other programs. As accuracy measure we use sensitivity (Sn) and specificity (Sp). For a feature (coding base, exon, transcript, gene) the sensitivity is defined as the number of correctly predicted features divided by the number of annotated features. The specificity is the number of correctly predicted features divided by the number of predicted features. A predicted exon is considered correct if both splice sites are at the annotated position of an exon. A predicted transcript is considered correct if all exons are correctly predicted and no additional exons not in the annotation. A predicted gene is considered correct if any of its transcripts are correct, i.e. if at least one isoform of the gene is exactly as annotated in the reference annotation.

Accuracy results from the rGASP Assessment (round 2) using RNA-Seq


The complete accuracy statistics is available on a page from the Computational Genomics Lab, Barcelona. Below pictures are loaded only after confirming the site authenticity. Click on broken image to confirm and enlarge!

human coding exon level human coding transcript level
human CDS level human CDS level
fly coding exon level fly gene level
fly CDS level fly CDS level
worm coding exon level
worm CDS level
Above accuracy plots are from Josep Abril, Computational Genomics Lab. Our AUGUSTUS predictions are labelled Mar.*. The worst performing prediction of AUGUSTUS (there are 3 sets in human, and worm each, and 2 sets in fly) are ab initio predictions and do not use any RNA-Seq at all. Other participant codes are here.

Accuracy results from the nGASP Assessment


Accuracy results from recent nGASP assessment on C. elegans: transcript-based
program base exon transcript gene
Sn Sp Sn Sp Sn Sp Sn Sp
AUGUSTUS 99.0 90.5 92.5 80.2 68.3 47.1 80.1 51.8
Fgenesh++ 97.6 89.7 90.4 80.9 65.5 53.4 78.3 54.2
MGENE 98.7 91.9 91.0 80.6 57.7 48.0 70.6 51.1
EUGENE 98.5 85.1 92.1 70.3 60.8 31.5 68.8 36.1
ExonHunter 93.7 92.0 81.2 76.9 37.2 39.7 45.6 40.5
Gramene 98.2 95.4 88.5 71.8 41.7 19.6 48.7 37.2
MAKER 92.9 88.5 80.7 66.3 41.3 19.6 50.7 47.6
Above accuracy values are taken from Coghlan et al. (2008): nGASP: the nematode genome annotation assessment project.

Accuracy results from recent nGASP assessment on C. elegans: ab initio
program base exon transcript gene
Sn Sp Sn Sp Sn Sp Sn Sp
AUGUSTUS 97.0 89.0 86.1 72.6 50.1 28.7 61.1 38.4
Fgenesh 98.2 87.1 86.4 73.6 47.1 34.6 57.8 35.4
GeneMark.hmm 98.3 83.1 83.2 65.6 37.7 24.0 46.3 24.5
MGENE 97.2 91.5 84.6 78.6 44.6 40.9 54.8 42.3
GeneID 93.9 88.2 77.0 68.6 36.2 22.8 44.4 25.1
Agene 93.8 83.4 68.9 61.1 9.8 13.1 12.0 14.1
CRAIG 95.6 90.9 80.2 78.2 35.7 36.3 43.8 37.8
EUGENE 94.0 89.5 80.3 73.0 49.1 28.8 60.2 30.2
ExonHunter 95.4 86.0 72.6 62.5 15.5 18.6 19.1 19.2
GlimmerHMM 97.6 87.6 84.4 71.4 47.3 29.3 58.0 30.6
SNAP 94.0 84.5 74.6 61.3 32.6 18.6 40.0 19.1
Above accuracy values are taken from Coghlan et al. (2008): nGASP: the nematode genome annotation assessment project.

Accuracy results from the EGASP Assessment


Accuracy results on human ENCODE regions (ab initio)
AUGUSTUS GENSCAN GENEID GENEMARK GENEZILLA
base level sensitivity 78.65% 84.17% 76.77% 76.09% 87.56%
base level specificity 75.29% 60.60% 76.48% 62.94% 50.93%
exon level sensitivity 52.39% 58.65% 53.84% 48.15% 62.08%
exon level specificity 62.93% 46.37% 61.08% 47.25% 50.25%
gene level sensitivity 24.32% 15.54% 10.47% 16.89% 19.59%
gene level specificity 17.22% 10.13% 8.78% 7.91% 8.84%
Above accuracy values are taken from Guigó et al. (2006): EGASP: the human ENCODE Genome Annotation Assessment Project.

Accuracy results on fruit fly data set adh222
long drosophila sequence Program
AUGUSTUS GENEID GENIE
base level sensitivity (std1) 98% 96% 96%
base level specificity (std3) 93% 92% 92%
exon level sensitivity (std1) 86% 71% 70%
exon level specificity (std3) 66% 62% 57%
gene level sensitivity (std1) 71% 47% 40%
gene level specificity (std3) 39% 33% 29%

Accuracy results on Arabidopsis data set araset
multi-gene sequences Program
AUGUSTUS
base level sensitivity 97%
base level specificity 72%
exon level sensitivity 89%
exon level specificity 70%
gene level sensitivity 62%
gene level specificity 39%

adh222 is a single sequence of drosophila melanogaster and 2.9Mb long.
There are two sets of annotations. The first, smaller set, called std1, was chosen so that the genes in it are likely to be correctly annotated and the second larger set, called std3, was chosen to be as complete as possible.
This dataset and the annotation was taken from here.

In the corrected version std1 contains 38 genes with a total of 111 exons and std3 contains 222 genes with a total of 909 exons.
The genes lie on both strands.

araset is a set of 74 multi-gene sequences with 168 genes of Arabidopsis thaliana. The specificity is likely to be underestimated because there are sometimes genes at the boundaries of a sequence that are not annotated.

The datasets can be downloaded.