Augustus [datasets]

comparative Augustus

Drosophila clade:

Files for the manuscript "Simultaneous Gene Finding in Multiple Genomes", which was presented at the German Conference of Bioinformatics in September 2015.

flies12way.hal (whole-genome alignment of the 12 Drosophila species in hal format)
flies12way_tree.nwk (tree of the 12 Drosophila species in Newick format)
hints.gff.gz (intron and exon part hints derived from RNA-Seq data for d.mel, d.sim, d.pse and d.vir)
extrinsic.cfg (extrinsic config file)
cgp_parameters.cfg (parameter file, include this with command line option --optCfgFile=cgp_parameters.cfg)

Augustus comparative gene prediction for Drosophila melanogaster
(using RNA-Seq evidence for d.mel, d.sim, d.pse and d.vir)
d_melanogaster.gff.gz (annotation in gff format)
d_melanogaster.fa.gz (coding sequences in fasta format)

vertebrate clade:

The vertebrate predictions were generated on a 12-way alignment of human (hg38), rhesus (rheMac3), mouse (mm10), rat (rn6), rabbit (oryCun2), dog (canFam3), cow (bosTau8), armadillo (dasNov3), elephant (loxAfr3), tenrec (echTel2), opossum (monDom5) and chicken (galGal4) extracted from the UCSC MultiZ 100-way alignment. For human, rhesus, mouse and chicken, paired-end RNA-Seq reads from the Sequence Read Archive were mapped to the corresponding genomes and used as hints for gene finding. Furthmore, CDS and intron hints were generated from the RefSeq annotation for human (coding genes only), and incorporated for annotation transfer from human to all other 11 vertebrates.



The following sequence files were used to train AUGUSTUS or to test its accuracy. Some of the datasets are described in the paper “Gene Prediction with a Hidden Markov Model and a new Intron Submodel”, which was presented at the European Conference on Computational Biology in September 2003 and appeared in the proceedings.

Test sets:



178 single-gene short human sequences (gzipped genbank format)


semi artificial genomic sequences from Guigo et al.: (gzipped genbank format)
sag178.fa.gz (gzipped fasta format)
sag178.gff (annotation in gff format)



100 single gene sequences from FlyBase: (gzipped Genbank format)


A 2.9 Mb long sequence from the Drosophila adh region (copied from the GASP dataset page)
adh.fa.gz (gzipped fasta format)
adh.std1.gff_corrected (gff format)
adh.std1+3.gff (gff format)

Arabidopsis thaliana:

Araset. 74 sequences with 168 genes. (gzipped genbank format)

Training sets:


single gene sequences from genbank (1284 genes): (gzipped genbank format)

11739 human splice sites, originally from Guigó et al., but filtered for similarities to h178, sag178:
splicesites.gz (gzipped flat file)


320 single gene sequences from FlyBase, disjoint with fly100: (gzipped genbank format)

400 single gene sequences from FlyBase, disjoint with adh122: (gzipped genbank format)


249 single gene sequences obtained by deleting the sequences from the Araball set which overlap with the sequences from Araset: (gzipped Genbank format)

Coprinus cinereus (a fungus):

851 single gene sequences predicted by genewise and compiled by Jason Stajich. 261 genes are complete, 590 genes are incomplete at the 3' end. Genes redundand with those in the Genbank annotations were deleted: (gzipped Genbank format)

91 sequences containing 93 genes from Genbank. Genes in Genbank with nothing else than the coding sequence were omitted. Identical or extremely similar genes in genbank were used only once. This set has first been used as a test set for above training set. The Coprinus version here used : (gzipped Genbank format)

Contact Impressum Data Privacy Protection