Directly to Contents
|AUGUSTUS Prediction Tutorial|
This website explains step-by-step how to use the AUGUSTUS prediction web server application to predict genes in a genomic sequence. You find a similar tutorial on how to train AUGUSTUS parameters here (click).
Functionalities of the AUGUSTUS prediction web server application are (with a single run):
1 - Job submission in general
1.1 - Finding the prediction submission form
1.2 - Filling in general job data
1.2.1 - E-mail address
1.2.2 - AUGUSTUS species parameters
220.127.116.11 - Uploading an archive file
18.104.22.168 - Project identifier
22.214.171.124 - What the AUGUSTUS species parameters are used for
1.2.3 - Genome file
126.96.36.199 - Genome file format
188.8.131.52 - Genome file upload options
184.108.40.206 - What the genome file is used for
1.3 - Optional fields
1.3.1 - cDNA file
220.127.116.11 - cDNA file format
18.104.22.168 - cDNA file upload options
22.214.171.124 - What cDNA files are used for
1.3.2 - Hints file
126.96.36.199 - Hints file format
188.8.131.52 - What hints files are used for
1.3.3 - UTR prediction
1.3.4 - Strand specific prediction
1.3.5 - Alternative transcripts
1.3.6 - Allowed gene structure
1.4 - Verfification that you are a human
1.5 - The submitt button
1.6 - Example data files
2 - What happens after submission
2.1 - Submission duplication
2.2 - Errors during prediction
3 - Prediction Results
The pipeline invoked by submitting a job to the AUGUSTUS prediction web server application is straight forward. If a cDNA file is supplied, hints are first generated from this cDNA file. If no cDNA file is supplied, AUGUSTUS is immediately called with the specified parameters.
The input fields of the AUGUSTUS prediction web server application form are: E-mail, AUGUSTUS species parameters, Genome file, cDNA file, Hints file and a number of options in form of checkboxes.
In the following, you find detailed instructions for submitting an AUGUSTUS prediction job.
You find the AUGUSTUS prediction submission form by clicking on the following link in the left side navigation bar:
This section describes all fields that should be filled in for every job submission, i.e. fields that are obligatory (except for the email adress, which is optional but strongly recommended).
At first, we recommend that you enter a valid e-mail address:
It is possible to run AUGUSTUS without giving an e-mail adress but here are some reasons why we recommend supplying an e-mail adress:
We use your e-mail address for the following purposes:
We do not use your e-mail address to send you any spam, i.e. about web service updates. We do not share your e-mail addresses with any third parties.
Job submission without giving an email adress is possible but discouraged for large input files.
The web server application offers you three options to specify which parameter set you want to use for predicting genes with AUGUSTUS. You can either uploaded a *.tar.gz parameter archive from your local harddrive, or you can specify the job ID of a previously finished AUGUSTUS web server application training run, or you can select a pre-trained parameter set through the drop-down menu.
A *.tar.gz archive with a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters:
where species is replaced by the name of the species you trained AUGUSTUS for (e.g. carrot would result it carrot/carrot_parameters.cfg). The additional species before the slash means that all those files must reside in a directory that is called species before you tar and gzip it. If you simply tar and gzip the folder that contains parameters of an AUGUSTUS training run, everything should work fine.
If you trained AUGUSTUS on this webserver, you may instead of downloading and re-uploading a parameter archive, simply specify the project identifier of this training run. You find the project identifier for example in the job confirmation e-mail. It starts either with train or with pred and is followed by 8 digits.
In addition to using parameters that you trained yourself, you may also use pre-trained parameters for the following species:
|Species||Project identifier||Courtesy of|
|Apis dorsata||adorsata||Francisco Camara Ferreira|
|Apis mellifera||honeybee1||Katharina Hoff and Mario Stanke|
|Bombus terrestris||bombus_terrestris2||Katharina Hoff|
|Callorhinchus milli||elephant_shark||Tereza Manousaki and Shigehiro Kuraku|
|Camponotus floridanus||camponotus_floridanus||Shishir K Gupta|
|Heliconius melpomene||heliconius_melpomene1||Sebastian Adler and Katharina Hoff|
|Gallus gallus domesticus||chicken||Stefanie Koenig|
|Petromyzon marinus||lamprey||Falk Hildebrand and Shigehiro Kuraku|
|Plants and algae|
|Triticum aestivum||wheat||Stefanie König|
|Aspergillus fumigatus||aspergillus_fumigatus||Jason Stajich|
|Aspergillus nidulans||aspergillus_nidulans||Jason Stajich|
|Aspergillus oryzae||aspergillus_oryzae||Jason Stajich|
|Aspergillus terreus||aspergillus_terreus||Jason Stajich|
|Botrytis cinerea||botrytis_cinerea||Jason Stajich|
|Candida albicans||candida_albicans||Jason Stajich|
|Candida guilliermondii||candida_guilliermondii||Jason Stajich|
|Candida tropicalis||candida_tropicalis||Jason Stajich|
|Chaetomium globosum||chaetomium_globosum||Jason Stajich|
|Coccidioides immitis||coccidioides_immitis||Jason Stajich|
|Coprinus cinereus||coprinus||Jason Stajich|
|Cryptococcus neoformans||cryptococcus_neoformans_neoformans_B||Jason Stajich|
|Debaryomyces hansenii||debaryomyces_hansenii||Jason Stajich|
|Encephalitozoon cuniculi||encephalitozoon_cuniculi_GB||Jason Stajich|
|Eremothecium gossypii||eremothecium_gossypii||Jason Stajich|
|Fusarium graminearum||fusarium_graminearum||Jason Stajich|
|Histoplasma capsulatum||histoplasma_capsulatum||Jason Stajich|
|Kluyveromyces lactis||kluyveromyces_lactis||Jason Stajich|
|Laccaria bicolor||laccaria_bicolor||Jason Stajich|
|Lodderomyces elongisporus||lodderomyces_elongisporus||Jason Stajich|
|Magnaporthe grisea||magnaporthe_grisea||Jason Stajich|
|Neurospora crassa||neurospora_crassa||Jason Stajich|
|Phanerochaete chrysosporium||phanerochaete_chrysosporium||Jason Stajich|
|Pichia stipitis||pichia_stipitis||Jason Stajich|
|Phizopus oryzae||rhizopus_oryzae||Jason Stajich|
|Saccharomyces cerevisiae||saccharomyces_cerevisiae_S288C||Jason Stajich|
|Schizosaccharomyces pombe||schizosaccharomyces_pombe||Jason Stajich|
|Ustilago maydis||ustilago_maydis||Jason Stajich|
|Verticillium longisporum||verticillium_longisporum1||Katharina Hoff and Mario Stanke|
|Yarrowia lipolytica||yarrowia_lipolytica||Jason Stajich, modified by Katharina Hoff|
|Archaea (experimental parameters)|
|Sulfolobus solfataricus||sulfolobus_solfataricus||Katharina Hoff|
|Bacteria (experimental parameters)|
|Escherichia coli||E_coli_K12||Katharina Hoff|
|Thermoanaerobacter tencongensis||thermoanaerobacter_tengcongensis||Katharina Hoff|
Please let us know whether you want to have parameters that you trained for a certain species to be included in this public list! If they are included in this list, they will also be distributed with the upcoming AUGUSTUS release.
The genome file is an obligatory input for predicting genes with AUGUSTUS.
>Chr.1 CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA >Chr.2 CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT TACACGGAAATCAACGGCGGTGTCATAAGCGAG >Chr.3 .....
The maximal number of scaffolds allowed in a genome file is 250000. If your file contains more scaffolds, please remove all short scaffolds. For training AUGUSTUS, short scaffolds are worthless because no complete training genes can be generated from them. In terms of prediction, it is possibleto predict genes in short scaffolds. However, those genes will in most cases be incomplete and probably unreliable.
Besides plain fasta format, our server accepts gzipped-fasta format for genome file upload. You find more information about gzip at the gzip homepage. Gzipped files have the file ending *.gz.
The AUGUSTUS prediction web server application offers two possiblities for transferring the genome file to the server: Upload a file and specify a web link to file.
You cannot do both at the same time! You must either select a file on your harddrive or give a web link!
The genome file is used as a template for gene prediction, it is the sequence in which you want to predict genes.
This section describes a number of fields that are optional for predicting genes with AUGUSTUS.
This feature is available only for academic, personal and non-profit use as this is required by the BLAT license.
The cDNA file is a multiple fasta DNA file that contains e.g. ESTs or full-length cDNA sequences. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n U u. Empty lines are not allowed and will be removed from the submitted file by the webserver application. An example for correct cDNA file format is given at 184.108.40.206 - Genome file format.
It is currently possible to submitt assembled RNA-seq transcripts instead of or mixed with ESTs as a cDNA/EST file. However, you should be aware that RNA-seq files are often much bigger than EST or cDNA files, which increases runtime of a prediction job. In order to keep runtime of your prediction job as low as possible, you should remove all assembled RNA-seq transcripts from your file that do not map to the submitted genome sequence. (In principle, this holds true for EST and cDNA files, too, but there, the problem is not as pronounced due to a smaller number of sequences.)
It is currently not allowed to upload RNA-seq raw sequences. (We filter for the average length of cDNA fasta entries and may reject the entire training job in case the sequences are on average too short, i.e. shorter than 400 bp.)
Besides plain fasta format, our server accepts gzipped-fasta format for cDNA file upload. You find more information about gzip at the gzip homepage. Gzipped files have the file ending *.gz. The maximal supported file size is 1 GB.
There are two options for cDNA file upload: upload from your local harddrive, or upload from a public http or ftp server. Please see 220.127.116.11 - Genome file upload options for a more detailed description of upload options.
The cDNA file is used for generating extrinsic evidence for gene structures in the gene prediction process, also called hints
It is possible to submit an externally created file that contains extrinsic evidence for gene structures in gff format.
In general, gff files must contain the following columns (the columns are separated by tabulators):
Correct format example:
HS04636 anchor exonpart 500 506 0 - . source=M HS04636 anchor exon 966 1017 0 + 0 source=M HS04636 anchor start 966 968 0 + 0 source=M HS04636 anchor dss 2199 2199 0 + . source=M HS04636 anchor stop 7631 7633 0 + 0 source=M HS04636 anchor intronpart 7631 7633 0 + 0 source=M
The hints file is used as extrinsic evidence that supports gene structure prediction. You can generate hints yourself based on any alignment program and information resource (e.g. ESTs, RNA-seq data, peptides, proteins, ...) that appears suitable to you.
It takes significantly more time to predict UTRs but in addition to reporting UTRs, it usually is also a little more accurate on the coding regions when ESTs are given as extrinsic evidence.
UTR prediction is only possible if UTR parameter files exist for your species. Even if UTR parameter files exist for a species, you should make sure, that they are species specific, i.e. have actually been optimized for your target species. It is a waste of time to predict UTRs with general (template) parameters.
If no UTR parameter files exist for your species but you enables UTR prediction in the form, the web server application will overrule the choice to predict UTRs by simply not predicting any UTRs.
By default, AUGUSTUS predicts genes in both strands but you may alter this behavior by checking another radio button in this field to predict genes in the forward (+) or reverse (-) strand, only.
By default, AUGUSTUS does not predict any alternative transcripts.
If you select few, then the following AUGUSTUS parameters are set to result in the prediction of relatively few alternative transcripts:
--alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5
If you select medium the AUGUSTUS parameters are set to
--alternatives-from-sampling=true --minexonintronprob=0.08 --minmeanexonintronprob=0.4
If you select many, AUGUSTUS parameters are set to
--alternatives-from-sampling=true --minexonintronprob=0.08 --minmeanexonintronprob=0.3
Predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.
Predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.
Predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).
Predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.
Ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.
Trying to avoid abuse of our web server application through bots, we implemented a captcha. The captcha is an image that contains a string. You have to type the string from the image into the field next to the image.
After filling out the appropriate fields in the submission form, you have to click on the button that says "Start Predicting" at the bottom of the page. It might take a while until you are redirected to the status page of your job. The reason is that we are checking various file formats prior job acceptance, and that the transfer of files from your local harddrive to our server might take a while. Please be patient and wait until you are redirected to the status page! Do not click the button more than once (it won't do any harm but it also doesn't speed up anything).
In the following, we provide some correctly formatted, compatible example data files:
http://bioinf.uni-greifswald.de/trainaugustus/examples/honeybee1.tar.gz - This file is an example of a AUGUSTUS species parameter archive file. Please do not upload this archive to our server since the identical parameters are usable through the AUGUSTUS species parameter project identifier honeybee1 and a re-upload would simply duplicate this data set. We only provide this file as an example which may help you check your own parameter archive in case incompatibilities with your application might occur. These parameters were optimized for predicting genes in Apis mellifera.
http://bioinf.uni-greifswald.de/trainaugustus/examples/LG16.fa - This file may be used as a Genome file. It contains linkage group 16 of Apis mellifera from GenBank (modified headers).
http://bioinf.uni-greifswald.de/trainaugustus/examples/honeybee-ests.fa - This file may be used as a cDNA file. It contains 3 ESTs of Apis mellifera from GenBank (modified headers).
http://bioinf.uni-greifswald.de/trainaugustus/examples/honeybee.hints - This file may be used as a Hints file. It contains hints that were generated from Apis mellifera RNA-Seq data for genome file LG16.fa.
You can insert some of these sample data sets by pressing the "Fill in Sample Data" button:
After you click the "Start Predicting" button, the web server application first validates whether the combination of your input fields is generally correct. If you submitted an unsupported input combination you will be redirected to the training submission form and an error message will be displayed at the top of the page.
If all fields were filled in correctly, the application is actually initiated. You will receive an e-mail that confirms your job submission and that contains a link to the job status page (if you supplied an e-mail adress). You will be redirected to the job status page.
In the beginning, the status page will display that your job has been submitted. This means, the web server application is currently uploading your files and validating file formats. After a while, the status will change to waiting for execution. This means that all file formats have been confirmed and an actually AUGUSTUS training job has been submitted to our grid engine, but the job is still pending in the queue. Depending on waiting queue length, this status may persist for a while. Please contact us in case you job is pending for more than one month. Later, the job status will change to computing. This means the job is currently computing. When the page displays finished, all computations have been finished and a website with your job's results has been generated.
You will receive an e-mail when your job has finished (if you supplied an e-mail adress).
Since predicting genes wiht AUGUSTUS may under certain circumstances be is a very resource consuming process, we try to avoid data duplication. In case you or somebody else tries to submitt exactly the same input file combination more than once, the duplicated job will be stopped and the submitter of the redundant job will receive information where the status page of the previously submitted job is located.
You should automatically receive an e-mail in case an error occurs during the AUGUSTUS gene prediction process. The admin of this server is also notified by e-mail about errors. We will get in touch with you, again, after we figured out what caused the error. If you did not supply an e-mail adress, errors are likely to be ignored by the AUGUSTUS webserver development team.
After job computations have finished, you will receive an e-mail (if you supplied an e-mail adress). The job status web page may at this point in time look similar to this:
This page should contain the file augustus.tar.gz. Please make a "right click" on the link and select "Save As" (or similar) to save the file on your local harddrive.
augustus.tar.gz is a gene prediction archive and its content depends on the input file combination. You can unpack the archive by typing tar -xzvf *.tar.gz into your shell. (You find more information about the software tar at the GNU tar website.)
# This output was generated with AUGUSTUS (version 2.6). # AUGUSTUS is a gene prediction tool for eukaryotes written by Mario Stanke (firstname.lastname@example.org) # and Oliver Keller (email@example.com). # Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008), # Using native and syntenically mapped cDNA alignments to improve de novo gene finding # Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013 # reading in the file /var/tmp/augustus/AUG-1855139717/hints.gff ... # Setting 1group1gene for E. # Sources of extrinsic information: M E # Have extrinsic information about 1 sequences (in the specified range). # Initialising the parameters ... # human version. Use default transition matrix. # Looks like /var/tmp/augustus/AUG-1855139717/input.fa is in fasta format. # We have hints for 1 sequence and for 1 of the sequences in the input set. # # ----- prediction on sequence number 1 (length = 6483, name = HSACKI10) ----- # # Delete group HintGroup , 5803-5803, mult= 1, priority= -1 1 features # Forced unstranded hint group to the only possible strand for 3 groups. # Deleted 1 groups because some hint was not satisfiable. # Constraints/Hints: HSACKI10 anchor start 182 184 0 + . src=M HSACKI10 anchor stop 3058 3060 0 + . src=M HSACKI10 anchor dss 4211 4211 0 + . src=M HSACKI10 b2h ep 1701 2075 0 . . grp=154723761;pri=4;src=E HSACKI10 b2h ep 1716 2300 0 + . grp=13907559;pri=4;src=E HSACKI10 b2h ep 1908 2300 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h ep 3592 3593 0 + . grp=13907559;pri=4;src=E HSACKI10 b2h ep 3836 3940 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h ep 5326 5499 0 + . grp=27937842;pri=4;src=E HSACKI10 b2h ep 5805 6157 0 + . grp=27937842;pri=4;src=E HSACKI10 b2h exon 3142 3224 0 + . grp=13907559;pri=4;src=E HSACKI10 b2h exon 3142 3224 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h exon 3592 3748 0 + . grp=154736078;pri=4;src=E HSACKI10 anchor intronpart 5000 5100 0 + . src=M HSACKI10 b2h intron 2301 3141 0 + . grp=13907559;pri=4;src=E HSACKI10 b2h intron 2301 3141 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h intron 3225 3591 0 + . grp=13907559;pri=4;src=E HSACKI10 b2h intron 3225 3591 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h intron 3749 3835 0 + . grp=154736078;pri=4;src=E HSACKI10 b2h intron 5500 5804 0 + . grp=27937842;pri=4;src=E HSACKI10 anchor CDS 6194 6316 0 - 0 src=M HSACKI10 anchor CDSpart 5900 6000 0 + . src=M # Predicted genes for sequence number 1 on both strands # start gene g1 HSACKI10 AUGUSTUS gene 182 3060 0.63 + . g1 HSACKI10 AUGUSTUS transcript 182 3060 0.63 + . g1.t1 HSACKI10 AUGUSTUS start_codon 182 184 . + 0 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS initial 182 225 1 + 0 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS internal 1691 2300 0.86 + 1 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS terminal 3049 3060 0.74 + 0 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS CDS 182 225 1 + 0 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS CDS 1691 2300 0.86 + 1 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS CDS 3049 3060 0.74 + 0 transcript_id "g1.t1"; gene_id "g1"; HSACKI10 AUGUSTUS stop_codon 3058 3060 . + 0 transcript_id "g1.t1"; gene_id "g1"; # coding sequence = [atgatgaaaccctgtctctaccaaaaagacaaaaaattagccagctcaagcaagcactactcttcctcccgcagtggag # gaggaggaggaggaggaggatgtggaggaggaggaggagtgtcatccctaagaatttctagcagcaaaggctcccttggtggaggatttagctcaggg # gggttcagtggtggctcttttagccgtgggagctctggtgggggatgctttgggggctcatcaggtggctatggaggattaggaggttttggtggagg # tagctttcatggaagctatggaagtagcagctttggtgggagttatggaggcagctttggagggggcaatttcggaggtggcagctttggtgggggca # gctttggtggaggcggctttggtggaggcggctttggaggaggctttggtggtggatttggaggagatggtggccttctctctggaaatgaaaaagta # accatgcagaatctgaatgaccgcctggcttcctacttggacaaagttcgggctctggaagaatcaaactatgagctggaaggcaaaatcaaggagtg # gtatgaaaagcatggcaactcacatcagggggagcctcgtgactacagcaaatactacaaaaccatcgatgaccttaaaaatcagagaacaacataa] # protein sequence = [MMKPCLYQKDKKLASSSKHYSSSRSGGGGGGGGCGGGGGVSSLRISSSKGSLGGGFSSGGFSGGSFSRGSSGGGCFGG # SSGGYGGLGGFGGGSFHGSYGSSSFGGSYGGSFGGGNFGGGSFGGGSFGGGGFGGGGFGGGFGGGFGGDGGLLSGNEKVTMQNLNDRLASYLDKVRAL # EESNYELEGKIKEWYEKHGNSHQGEPRDYSKYYKTIDDLKNQRTT] # Evidence for and against this transcript: # % of transcript supported by hints (any source): 20 # CDS exons: 1/3 # E: 1 # CDS introns: 0/2 # 5'UTR exons and introns: 0/0 # 3'UTR exons and introns: 0/0 # hint groups fully obeyed: 0 # incompatible hint groups: 5 # E: 3 (gi|154723761,gi|13907559,gi|154736078) # M: 2 # end gene g1 ###
Click here to view a real AUGUSTUS prediction web service output!
It is important thatyou check the results of an AUGUSTUS gene prediction run. Do not trust predictions blindly! Prediction accuracy depends on the input sequence quality, on hints quality and on whether a given parameter set fits to the species of the supplied genomic sequence.