WebAUGUSTUS Help
This website contains short instructions and some frequently asked questions concerning
- the training of AUGUSTUS and
- predicting genes in a new genomic sequence with pre-trained parameters.
For more detailed instructions, please read Training Tutorial and Prediction Tutorial.
Contents
Why do I not get any results?
Why is the server busy?
What is the species name?
Why should I give my e-mail address?
File upload versus web link
Instructions for fasta headers
Which files must or can I submit for training AUGUSTUS?
Which files are required for predicting genes in a new genome?
Genome file
cDNA file
Protein file
Training gene structure file
Hints file
Parameter archive
What is the project identifier?
What does my job status mean?
UTR prediction: yes or no?
Allowed gene structure
What about that data duplication?
Why is the prediction accuracy in the genome of my species not as good as I expected?
What about data privacy and security?
Gene prediction results
Training results
I am not from academia/not non-profit. What can I do?
No submission of personalized human sequence data!
Why do I see a running dog after pressing the submission button?
Why do I not get any results?
-
Did an obvious error occur?
Please contact augustus-web@uni-greifswald.de if you are not sure what the error message is telling you.
One frequently occurring error in the AutoAug.err file is the following:
The file with UTR parameters for train****** does not seem to exist. This likely means that the UTR model has not been trained yet for train******.
This error message tells you that no UTR parameters were trained for your species. If no other error messages are contained above the first UTR error message, the general results of your job are ok, you simply did not get UTR parameters and thus no predictions with UTR.
Illegal division by zero at scripts/autoAugTrain.pl line 241.
failed to execute: No such file or directory
This error occurs when not training gene structures were generated/available. This may be caused by one of the following circumstances:
- You supplied a genome and protein file. In this case, Scipio was not able to generate any complete gene structures from the data set. In most cases, some incomplete gene structures were produced, but since they frequently cause crashes in the augustus training routine, we do not use them within the web service.
- The files that you supplied had long an complex fasta headers. This causes problems with PASA and Scipio. Take care that the fasta headers in all your files are unique, short, do not contain whitespaces or special characters.
- Did you submit your job a long time ago and it seems to be "stuck" at the status of "computing"?
Please contact augustus-web@uni-greifswald.de to inquire whether your job is really still running. - Did your job finish but there are just no parameters or predictions?
The quality of results depends on the quality and combination of your input data. If the input data did e.g. not provide sufficient information for generating training genes, then no AUGUSTUS parameters will be optimized for your species, and no predictions will be made. In case of the gene prediction web server application, it is also possible that your submitted genome sequence does not contain any protein coding genes.
Why is the server busy?
Training AUGUSTUS is a very resource and time consuming process. We use a grid engine queuing system with a limited number of waiting slots. If we estimate that the time from job submission to computation start might be very long, our web server might display a message that our server is busy. The submission of new jobs is then disabled (prediction and training submission will both be disabled). Please wait one or two weeks before you try a new submission. If the problem persists longer than a month, or if your job is urgent, please contact augustus-web@uni-greifswald.de.
What is the species name?
The species name is the name of the species for whose genome you want to train AUGUSTUS. The species name is an obligatory parameter. Considering that AUGUSTUS training is such a time consuming process, our objective is to know the names of species for which AUGUSTUS was trained in order to make the trained parameters available to the public so that others who are interested in the same species as you do not have to rerun the training process. (We will only explicitely publish your parameter set with the next AUGUSTUS release after confirming via e-mail that you agree to this.)
However, if you do not want to reveal the true species name, you may use any other string shorter than 30 characters as a species name.
The species name is not allowed to contain spaces!
Why should I give my e-mail address?
Unlike many other bioinformatics web services, the AUGUSTUS web server application is not an implementation of a fail-safe procedure. Particularly the assembly of a training gene set from extrinsic data (ESTs and protein sequences) and a genome sequence may not always work perfectly. Our pipeline may issue warnings or errors, and sometimes, we need to get some feedback from you via e-mail in order to figure out what is the problem with your particular input data set.
In addition, training and running AUGUSTUS are rather time consuming processes that may take up to several weeks (depending on the input data). It may be more convenient to receive an e-mail notification about your job having finished, than checking the status page over and over, again.
Therefore, we strongly recommend that you enter an e-mail address.
If supplied, we use your e-mail address for the following purposes:
- Confirming your job submission
- Confirming successful file upload (for large files via ftp/http)
- Notifying you about your job having finished
- Informing you about any problems that might occur during your particular job and asking questions about that job in order to solve those problems
We do not use your e-mail address to send you any spam, i.e. about web service updates. We do not share your e-mail address with any third parties. Please read our Data Privacy Protection declaration.
Job submission without giving an email address is possible but discouraged.
If you provide an e-mail address, we kindly ask to check the confirmation checkbox that you agree to the following terms:
"If I provide an e-mail address, I agree that it will be stored on the server until the computations of my job have finished. I agree to receive e-mails that are related to the particular WebAUGUSTUS job that I submitted."
File upload versus web link
The AUGUSTUS training and prediction web server application offers in some cases two possibilities for transferring files to the server: Upload a file and specify a web link to file.
- For small files, please click on the Browse-button and select a file on your harddrive.
If you experience a Connection timeout (because your file was too large for this type of upload), please use the option for large files! - Large files can be retrieved from a public web link. Specify a valid ftp or http URL to your sequence file. Our server will fetch the file from the given address. (WebAUGUSTUS does not accept dropbox links!)
You cannot do both at the same time! For each file type (e.g. the genome file), you must either select a file on your harddrive or give a web link!
Instructions for fasta headers
We observed that most problems with generating training genes for training AUGUSTUS are caused by fasta headers in the sequence files. Some of the tools in our pipeline will truncate fasta headers if they are too long or contain spaces, or contain special characters. This definitely leads to a lot of warning messages in the AutoAug.err file, and it may also lead to non-unique fasta entry names, which will lead to a crash of the pipeline. We therefore strongly recommend that you adhere to the following rules for fasta headers when using our web services:
- no whitespaces in the headers
- no special characters in the headers (e.g. !#@&|;)
- make the headers as short as possible
- let headers not start with a number but with a letter
- let headers contain letters and numbers, only
In the following we give some header examples that will not cause problems:
>entry1
>contig1000
>est20
>scaffold239
The following kinds of headers will cause at least warning messages but probably also a pipeline crash:
>contig1 length=1000 Arabidopsis thaliana
>gi|123344545|some_protein|some_species
>Drosophila melanogaster scaffold 10000
If you have a fasta file with unsuitable headers and you do not know how to modify them automatically, you may use the Perl script simplifyFastaHeaders.pl. After saving it on your local Unix system, first check whether the location of Perl in the first line of the script is correct for your system (#!/usr/bin/perl). If Perl is installed in another location, you need to modify that line! Then, execute the script with the following parameters:
perl simplifyFastaHeaders.pl in.fa nameStem out.fa header.map
- in.fa is the input fasta file, it must already be in valid fasta format
- nameStem is a character descriptor that will be used as a start for all simplified headers, e.g. est, or contig, or protein, etc. Be aware that fasta headers must always be unique, so choose different nameStems for genome and cDNA and protein file!
- out.fa is the output fasta file with simplified fasta headers, this file can be processed by our web service.
- header.map is a map that contains the simplied header and the original header in a tabular separated format.
Why is the simplification of fasta headers not a built in function of the web service? The reason is that we think you should be able to recognize the predictions later on! Gene predictions will be made available in gff format, which contains the sequence name in the first column. Therefore, you should modify the fasta headers yourself, before submitting data to the web service!
Which files must or can I submit for training AUGUSTUS?
You need to specify
- a genome file and
- at least one out of the following files: cDNA file, training gene structure file, and protein file.
Please consider that training AUGUSTUS is a time and resource consuming process. For optimal results, you should specify as much information as possible for a single training run instead of starting the AUGUSTUS training multiple times with different file combinations! If you have a lot of EST data, we recommend that you submit ESTs instead of protein sequences since ESTs will likely allow the generation of a UTR training set.
Which files are required for predicting genes in a new genome?
For predicting genes in a new genome with already trained parameters, you need to specify
- a genome file and
- a parameter archive. Instead of uploading the archive, you may also enter a valid project identifier in case you trained AUGUSTUS on this web server and the training has already finished; or you may select pre-trained parameter set from the drop down menu.
You may in addition specify an EST/cDNA file and/or a hints file that will be used as extrinsic evidence for predicting genes.
Genome file
The genome file is an obligatory file for training AUGUSTUS and for making predictions with pre-trained parameters in a new genome. It must contain the genome sequence in (multiple) fasta format. Every header begins with a >. The sequence must be DNA. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n. (Internally, AUGUSTUS will interpret everything that is not A a T t C c G g as an N!) Empty lines are not allowed. If they occur, they will automatically be removed by the webserver applications.
WebAUGUSTUS does have a strict limit for character per line for FASTA format files. Disobeying this restriction might cause memory issues on our server. We recommend to format sequences in FASTA files submitted to WebAUGUSTUS with a unix linebreak after 80 characters.
Headers must be unique within a file! We recommend that you use short fasta headers. Headers like
>gi|382483733|gb|GZ667513.1|GW667513 SSH_BP_47 Some species Wicked root cDNA library Some species cDNA clone SSH_BP_47 similar to Putative NADH-cytochrome B5 reductase, mRNA sequence
are likely to cause a lot of warning messages. An example for a short header created from the too long header above:
>GZ667513.1
Correct file format example:
>Chr.1 CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA >Chr.2 CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT TACACGGAAATCAACGGCGGTGTCATAAGCGAG >Chr.3 .....
The maximal number of scaffolds allowed in a genome file is 250000. If your file contains more scaffolds, please remove all short scaffolds. For training AUGUSTUS short scaffolds are worthless because no complete training genes can be generated from them. In terms of prediction, it is possible to predict genes in short scaffolds. However, those genes will in most cases be incomplete and probably unreliable.
cDNA file
The cDNA file is a multiple fasta DNA file that contains e.g. ESTs or full-length cDNA sequences. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n U u. Empty lines are not allowed and will be removed from the submitted file by the webserver application. See Genome file for a format example. Upload of a cDNA file to our web server application will invoke the software BLAT [2], which is on our webserver application only available for academic, personal and non-profit use.
Protein file
The protein file is a multiple fasta file that contains protein sequences as supporting evidence for genes. Allowed sequence characters: A a R r N n D d C c E e Q q G g H h I i L l K k M m F f P p S s T t W w Y y V v B b Z z J j X x. Empty lines are not allowed but will simply be removed from the file by the webserver application.
Correct file format example:
>protein1 maaaafgqlnleepppiwgsrsvdcfekleqigegtygqvymakeiktgeivalkkirmd neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey mdhdltgladrpglrftvpqikcymkqlltglhychvnqvlhrdikgsnllidnegnlkl adfglarsyshdhtgnltnrvitlwyrppelllgatkygp >protein2 neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey mdhdltgladrpglrftvpqikcymkqlltglhychvnqv >protein3 ...
Submitting a protein file to our AUGUSTUS training web server application will invoke Scipio [3], which uses BLAT [2]. Therefore, protein file upload is only available for academic, personal and non-profit use on our web server application.
Training gene structure file
You can submit your own, externally created training gene structures to the AUGUSTUS training web server application. Regardless of the format, gene structure files are not allowed to contain java metacharacters like "*" or "?".
Training gene structure files can be submitted in two different formats: Genbank format or gff format.
Training gene structure file in genbank format
Gene structures in genbank format must contain the coding sequence parts and flanking regions. Flanking regions are important because AUGUSTUS is supposed to differentiate between genes and intergenic regions. The length of flanking regions depends on the length of genes in the target genome. In our pipeline, flanking regions are set to the average gene length (exceptionally applying the extreme limits between 1000 and 10000 nt). It is very important to make sure that the flanking regions do not contain any other protein coding gene parts, i.e. we recommend to trim flanking regions in a way that will exclude other CDS parts.
It is important for our pipeline that the LOCUS names within a submitted training gene structure file are unique, i.e. you should not use the same LOCUS name more than one time!
Correct file format example (condensed view, the three dots represent further lines of sequences):
LOCUS Chr.1_1-159458 159458 bp DNA FEATURES Location/Qualifiers source 1..159458 CDS complement(join(2421..2655,3858..4005,4080..4235,5569..5857 ,10316..10534,155240..155458)) /gene="1474336" BASE COUNT 49195 a 29117 c 28985 g 49950 t 2211 n ORIGIN 1 aaaatacatc acaatacatt taattcactt tccatcatcg agattaacga aaattattta 61 aaatatcgaa gatgaaaata tcctcaagat gatactgaac ggctaagaaa aatacatcac 121 acaactttaa ttcattttcc atcatcgaga ttaacgaaaa gaaaaaattt taactcccta ... 159301 atacgccacc aggtatttcg cctgattgtt cctcgaatat cttctctctc tctatatata 159361 tatatattac ttggcacgat aatcgtcgaa tcgttattta taaattgctt catctatcgc 159421 gatatttttg caacaactct cgcttttctc tttccatt // LOCUS Chr.1_313992-323129 9138 bp DNA FEATURES Location/Qualifiers source 1..9138 CDS join(4001..4048,4989..5138) /gene="194551" BASE COUNT 2829 a 1502 c 1750 g 2948 t 109 n ORIGIN 1 ttttccttct ttcttttttt tttatttaca ttaatgagaa ttttcgcaaa tatttcatcg 61 ctgccatcct tttttttcct cgacgtcaat cacgcgacac atttgttaga gaaatggatt 121 ttaatcttga aaaaagaaaa atacaaatgc caacgcattt caaatccttt cctattatta ... 9001 tcaacgaaac aaataattgc ttcacaaaat atcgcacgta acaacaatat agacttcaat 9061 attcaacaat tcttttcctt tatacacaaa gatacacaaa atataaaagt tttaatactt 9121 caacttcaac gaaacagg //
If you want to train UTRs, you have to additionally incorporate mRNA information in your genbank file.
Correct file format example (including UTR training):
LOCUS scf7180001240730_g20 526 bp DNA FEATURES Location/Qualifiers source 1..526 mRNA 99..125 CDS 99..99 BASE COUNT 164 a 99 c 68 g 195 t ORIGIN 1 gtgacggagc ccaaggacga gcccgtgccc tcagagccca cgtccgacgt gaggcccgcg 61 ccagcgcccc tcccgccgcc cgtcgcagcc actgcttaga ctttactaat ataaacattg 121 aaaatatttt gtgttttatt tccaatcatt gaattataat cctattataa tataactaac 181 attcgtaatt ttacaaaata actatgcaaa ttattttgta ttttcgtttt aaattatact 241 tttcatataa atttctacaa atcttattca agaccataag tatccgctcg ctctacttcg 301 ggcatttcct ttatttatat cttatttgac ttattttgat tatttaggct tatgttttcg 361 atactattga aaacagaaaa taatttcata taattaataa tatattttca attaatatat 421 ttaacaaata tttgtatagt tcaagcggac aaatccgttc ccatagtatt tatataaatt 481 ttaatttaga gtaataacag tttgctgtat tgttgtagtc aaatac // LOCUS scf7180001240751_g30 876 bp DNA FEATURES Location/Qualifiers source 1..876 mRNA complement(401..777) CDS complement(777..777) BASE COUNT 300 a 136 c 116 g 324 t ORIGIN 1 aatgtaggaa aatgaaatat ttatttaaat tgttattatc acttcttcgc tctagtgtct 61 tggcaaagcg cggcgttgag ttcagcctct cacacgcaat gcctccagaa ttcggcgaaa 121 tgtgggggac agagtgtatt aacactaagt tccctcagcc acgactggtg aaattatata 181 ttcagtttgt atactattac tcatgcaaac acttcatcat actttcactc aatcagtaaa 241 gcataatatt ttatttaata ttgtttatca atactatttc cttgttgtta aatattattt 301 tatttattat attaaattaa aatgtcaaaa ttaaaagtag gtgatgattt attactatct 361 tttctatcca agaaaaaaaa gacacactga aacaattgta atttttgtta tgtttttatt 421 acttaatatt attataaaaa tttgtaaata cgaaataaaa tagatagacg taataatatt 481 tatttgttag ttaataataa taatgataat tacgaaagat acaagaaata tgcataaatg 541 agtgttatat tatgtatttt atgagaatat aaatataaaa actgtcattg attatatttt 601 ctaaatactt tcattttatg gcttgctggc ttttcaattt ccttatgttt cagcttttca 661 ctcaatagag cgaaaccttc atcgacatgt aagccaatag aacaattaca aactaacttt 721 attacatcag tcttttcatt tctttaagct tcaggcaaat atcatctaaa tgcctttcaa 781 ctcgctacta acatcgcgtc gttatataaa tcagtgtata cggaattaaa cctgtcatgt 841 ctcttgcaag acgtgtctgc tgttgtcacg cacaca //
Training gene structure file in gff format
Training gene structure in gff format must comply with the fasta entry names of the genome file.
In general, gff format must contain the following columns (The columns are separated by tabulators):
- The sequence names must be found in the fasta headers of sequences in the genome file.
- The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
- The feature may for AUGUSTUS training be CDS, 5'-UTR or 3'-UTR.
- Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
- Stop position, must be at least as large as start position.
- The score must be a number but the number is irrelevant to our web server applications.
- The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
- Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
- Attribute contains a transcript identifier. All gff-entries belonging to one transcript must contain the same transcript identifier in the last column.
Correct file format example (without UTR):
Chr.1 mySource CDS 1767 1846 1.000 - 0 transcript_id "1597_1" Chr.1 mySource CDS 1666 1709 1.000 - 1 transcript_id "1597_1" Chr.1 mySource CDS 1486 1605 1.000 - 2 transcript_id "1597_1" Chr.1 mySource CDS 1367 1427 1.000 - 2 transcript_id "1597_1" Chr.1 mySource CDS 1266 1319 1.000 - 1 transcript_id "1597_1" Chr.1 mySource CDS 1145 1181 1.000 - 1 transcript_id "1597_1" Chr.1 mySource CDS 847 1047 1.000 - 0 transcript_id "1597_1" Chr.2 mySource CDS 9471 9532 1.000 + 0 transcript_id "1399_2" Chr.2 mySource CDS 9591 9832 1.000 + 1 transcript_id "1399_2" Chr.2 mySource CDS 9885 10307 1.000 + 2 transcript_id "1399_2" Chr.2 mySource CDS 10358 10507 1.000 + 2 transcript_id "1399_2" Chr.2 mySource CDS 10564 10643 1.000 + 2 transcript_id "1399_2"
Correct file format example (with UTR):
Chr.1 mySource 5'-UTR 277153 277220 45 + . transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 277221 277238 1 + 0 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 278100 278213 1 + 0 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 278977 279169 1 + 0 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 279630 279648 0.94 + 2 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 279734 279768 0.94 + 1 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource CDS 280307 280344 1 + 2 transcript_id "g22472.t1"; gene_id "g22472"; Chr.1 mySource 3'-UTR 280345 280405 78 + . transcript_id "g22472.t1"; gene_id "g22472";
Hints file
For the gene prediction web server application, it is possible to submit an externally created file that contains extrinsic evidence for gene structures in gff format.
Comment lines are not allowed.
It makes no sense to upload gene prediction files (e.g. augustus.gff from AUGUSTUS gene prediction) as hints to WebAUGUSTUS. This is therefore not allowed.
In general, gff format must contain the following columns (The columns are separated by tabulators):
- The sequence names must be found in the fasta headers of sequences in the genome file.
- The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
-
The feature may for AUGUSTUS gene prediction be
- start - translation start, specifies an interval that contains the start codon. The interval can be larger than 3 nucleotides, in which case every ATG in the interval gets a bonus.
- stop - translation end (stop codon)
- tss - transcription start site
- tts - transcription termination site
- ass - acceptor (3') splice site, the last intron position
- dss - donor (5') splice site, the first intron position
- exonpart - part of an exon in the biological sense.
- exon - complete exon in the biological sense.
- intronpart - introns both between coding and non-coding exons.
- intron - complete intron in the biological sense
- CDSpart - part of the coding part of an exon. (CDS = coding sequence)
- CDS - coding part of an exon with exact boundaries. For internal exons of a multi exon gene this is identical to the biological boundaries of the exon. For the first and the last coding exon the boundaries are the boundaries of the coding sequence (start, stop).
- UTRpart - The hint interval must be included in the UTR part of an exon.
- UTR - exact boundaries of a UTR exon or the untranslated part of a partially coding exon.
- irpart - intergenic region part. The bonus applies to every base of the intergenic region. If UTR prediction is turned on (--UTR=on) then UTR is considered genic.
- nonexonpart - intergenic region or intron.
- genicpart - everything that is not intergenic region, i.e. intron or exon or UTR if applicable.
- Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
- Stop position, must be at least as large as start position.
- The score must be a number but the number is irrelevant to our web server applications.
- The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
- Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
- For usage as hint, Attribute must contain the string source=M (for manual). Other sources, such EST or protein, are possible, but only in the command line version of AUGUSTUS. Source types other than M are ignored by AUGUSTUS web server applications.
Correct format example:
Chr.1 anchor exonpart 500 506 0 - . source=M Chr.1 anchor exon 966 1017 0 + 0 source=M Chr.1 anchor start 966 968 0 + 0 source=M Chr.1 anchor dss 2199 2199 0 + . source=M Chr.1 anchor stop 7631 7633 0 + 0 source=M Chr.1 anchor intronpart 7631 7633 0 + 0 source=M
Parameter archive
A *.tar.gz archive with a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters:
- species/species_parameters.cfg
- species/species_metapars.cfg
- species/species_metapars.utr.cfg
- species/species_exon_probs.pbl.withoutCRF
- species/species_exon_probs.pbl
- species/species_weightmatrix.txt
- species/species_intron_probs.pbl
- species/species_intron_probs.pbl.withoutCRF
- species/species_igenic_probs.pbl
- species/species_igenic_probs.pbl.withoutCRF
where species is replaced by the name of the species you trained AUGUSTUS for (e.g. carrot would result it carrot/carrot_parameters.cfg). The additional species before the slash means that all those files must reside in a directory that is called species (or in our example: carrot) before you tar and gzip it. If you simply tar and gzip the folder that contains parameters of an AUGUSTUS training run, everything should work fine.
What is the project identifier?
If you trained AUGUSTUS on this webserver, you may instead of uploading a parameter archive, simply specify the project identifier of this training run. You find the project identifier for example in the subject line for your training confirmation e-mail, where it says Your AUGUSTUS training job project_id. Project identitfiers typically consist of the letters pred or train, followed by a random string of 8 digits resulting in for example train345kljD4.
What does my job status mean?
In the beginning, the status page will display that your job has been submitted. This means, the web server application is currently uploading your files and validating file formats. After a while, the status will change to waiting for execution. This means that all file formats have been confirmed and an AUGUSTUS training job has been submitted to our grid engine, but the job is still pending. Depending on waiting queue length, this status may persist for a while. Please contact us in case you job is pending for more than one month. Later, the job status will change to computing. This means the job is currently computing. When the page displays finished, all computations have been finished and a website with your job's results has been generated.
You will receive an e-mail with the link to the results of your job when computations are finished if you specified an email address.
UTR prediction: yes or no?
It takes significantly more time to predict UTRs but in addition to reporting UTRs, it usually is also a little more accurate on the coding regions when ESTs are given as extrinsic evidence.
UTR prediction is only possible if UTR parameter files exist for your species. Even if UTR parameter files exist for a species, you should make sure, that they are species specific, i.e. have actually been optimized for your target species. It is a waste of time to predict UTRs with general (template) parameters.
UTR prediction is only supported in combination with the following two gene structure constraints:
- predict any number of (possibly partial) genes
- only predict complete genes
UTR prediction is not possible in combination with the following constraints:
- only predict complete genes - at least one
- predict exactly one gene
- ignore conflicts with other strand
If no UTR parameter files exist for your species but you enables UTR prediction in the form, the web server application will overrule the choice to predict UTRs by simply not predicting any UTRs.
Species for which UTR parameters are available:
- Acyrthosiphon pisum (pea_aphid)
- Amphimedon queenslandica (amphimedon)
- Apis mellifera (honeybee1)
- Bombus terrestris (bombus_terrestris2)
- Caenorhabditis elegans (caenorhabditis)
- Drosophila melanogaster (fly)
- Homo sapiens (human)
- Trichinella spiralis (trichinella)
- Toxoplasma gondii (toxoplasma)
- Arabidopsis thaliana (arabidopsis)
- Chlamydomonas reinhartii (chlamy2011)
- Galdieria sulphuraria (galdieria)
- Solanum lycopersicum (tomato)
Allowed gene structure
Predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.
Predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.
Predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).
Predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.
Ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.
What about that data duplication?
We are trying to avoid data duplication. If you submitted some data that was already submitted before, by you or by somebody else, we will display a link to the previously submitted job.
Why is the prediction accuracy in the genome of my species not as good as I expected?
Gene prediction accuracy of AUGUSTUS in the genome of a certain species depends on the quality of training genes that were used for optimizing species specific parameters. The pipeline behind our AUGUSTUS training web server application offers a fully automated way of generating training genes, but it does not replace manual quality checks on the training genes that are often needed for improving the training gene set quality.
In order to improve accuracy, you could manually inspect the generated training genes and select a trustworthy subset and try retraining AUGUSTUS with this subset. It also helps to compare the training gene set to other sources of evidence that are not supported by our web server application, e.g. RNA-seq data.
What about data privacy and security
Please read our data privacy protection declaration (German language, only).
The results of your job submission (i.e. in case of the training web server application that means log files, trained parameters, training genes, ab initio gene predictions and gene prediction with hints; or in case of the prediction web server application the augustus prediction archive) are publicly available. The link to your job status contains a long, pseudo-random string (uuid), and one needs to guess the string in order to get access to the results - but this is not particularly secure!
Other users who submit exactly the same input files as have been submitted before, will be redirected to the results page of the previously submitted job. They do not need to guess the link.
Files that you upload to our server, e.g. sequence files, are not directly made available to anyone. However, if you chose to upload a file via http/ftp link, the link to your file is displayed on the job status page.
We are interested in redistributing high quality parameter sets for novel species with the AUGUSTUS release. We will not do so without your explicit permission. Please contact us if you would like to add your parameters to the AUGUSTUS repository.
Our server logs e-mail addresses, IP addresses and all job submission details. We store this data for up to 180 days in order to be able to trace back errors. E-mail addresses are stored until your job has finished. By submitting a job, you agree that we log this data.
Please contact augustus-web@uni-greifswald.de if your particular job requires a more secure environment, e.g. as part of a collaboration.
Prediction results
After job computations have finished, you will receive an e-mail (if you supplied an e-mail address). The job status web page may at this point in time look similar to this:
This page should contain the file augustus.tar.gz. Please make a "right click" on the link and select "Save As" (or similar) to save the file on your local harddrive.
augustus.tar.gz is a gene prediction archive and its content depends on the input file combination. You can unpack the archive by typing tar -xzvf *.tar.gz into your shell. (You find more information about the software tar at the GNU tar website.)
Files that are always contained in gene prediction archives:
- *.gff - gene predictions in gff format
Format example AUGUSTUS prediction gff file:
# This output was generated with AUGUSTUS (version 2.6). # AUGUSTUS is a gene prediction tool for eukaryotes written by Mario Stanke (mario.stanke@uni-greifswald.de) # and Oliver Keller (keller@cs.uni-goettingen.de). # Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008), # Using native and syntenically mapped cDNA alignments to improve de novo gene finding # Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013 # reading in the file /var/tmp/augustus/AUG-1855139717/hints.gff ... # Setting 1group1gene for E. # Sources of extrinsic information: M E # Have extrinsic information about 1 sequences (in the specified range). # Initializing the parameters ... # human version. Use default transition matrix. # Looks like /var/tmp/augustus/AUG-1855139717/input.fa is in fasta format. # We have hints for 1 sequence and for 1 of the sequences in the input set. # # ----- prediction on sequence number 1 (length = 6483, name = Chr.1) ----- # # Delete group HintGroup , 5803-5803, mult= 1, priority= -1 1 features # Forced unstranded hint group to the only possible strand for 3 groups. # Deleted 1 groups because some hint was not satisfiable. # Constraints/Hints: Chr.1 anchor start 182 184 0 + . src=M Chr.1 anchor stop 3058 3060 0 + . src=M Chr.1 anchor dss 4211 4211 0 + . src=M Chr.1 b2h ep 1701 2075 0 . . grp=154723761;pri=4;src=E Chr.1 b2h ep 1716 2300 0 + . grp=13907559;pri=4;src=E Chr.1 b2h ep 1908 2300 0 + . grp=154736078;pri=4;src=E Chr.1 b2h ep 3592 3593 0 + . grp=13907559;pri=4;src=E Chr.1 b2h ep 3836 3940 0 + . grp=154736078;pri=4;src=E Chr.1 b2h ep 5326 5499 0 + . grp=27937842;pri=4;src=E Chr.1 b2h ep 5805 6157 0 + . grp=27937842;pri=4;src=E Chr.1 b2h exon 3142 3224 0 + . grp=13907559;pri=4;src=E Chr.1 b2h exon 3142 3224 0 + . grp=154736078;pri=4;src=E Chr.1 b2h exon 3592 3748 0 + . grp=154736078;pri=4;src=E Chr.1 anchor intronpart 5000 5100 0 + . src=M Chr.1 b2h intron 2301 3141 0 + . grp=13907559;pri=4;src=E Chr.1 b2h intron 2301 3141 0 + . grp=154736078;pri=4;src=E Chr.1 b2h intron 3225 3591 0 + . grp=13907559;pri=4;src=E Chr.1 b2h intron 3225 3591 0 + . grp=154736078;pri=4;src=E Chr.1 b2h intron 3749 3835 0 + . grp=154736078;pri=4;src=E Chr.1 b2h intron 5500 5804 0 + . grp=27937842;pri=4;src=E Chr.1 anchor CDS 6194 6316 0 - 0 src=M Chr.1 anchor CDSpart 5900 6000 0 + . src=M # Predicted genes for sequence number 1 on both strands # start gene g1 Chr.1 AUGUSTUS gene 182 3060 0.63 + . g1 Chr.1 AUGUSTUS transcript 182 3060 0.63 + . g1.t1 Chr.1 AUGUSTUS start_codon 182 184 . + 0 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS initial 182 225 1 + 0 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS internal 1691 2300 0.86 + 1 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS terminal 3049 3060 0.74 + 0 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS CDS 182 225 1 + 0 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS CDS 1691 2300 0.86 + 1 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS CDS 3049 3060 0.74 + 0 transcript_id "g1.t1"; gene_id "g1"; Chr.1 AUGUSTUS stop_codon 3058 3060 . + 0 transcript_id "g1.t1"; gene_id "g1"; # coding sequence = [atgatgaaaccctgtctctaccaaaaagacaaaaaattagccagctcaagcaagcactactcttcctcccgcagtggag # gaggaggaggaggaggaggatgtggaggaggaggaggagtgtcatccctaagaatttctagcagcaaaggctcccttggtggaggatttagctcaggg # gggttcagtggtggctcttttagccgtgggagctctggtgggggatgctttgggggctcatcaggtggctatggaggattaggaggttttggtggagg # tagctttcatggaagctatggaagtagcagctttggtgggagttatggaggcagctttggagggggcaatttcggaggtggcagctttggtgggggca # gctttggtggaggcggctttggtggaggcggctttggaggaggctttggtggtggatttggaggagatggtggccttctctctggaaatgaaaaagta # accatgcagaatctgaatgaccgcctggcttcctacttggacaaagttcgggctctggaagaatcaaactatgagctggaaggcaaaatcaaggagtg # gtatgaaaagcatggcaactcacatcagggggagcctcgtgactacagcaaatactacaaaaccatcgatgaccttaaaaatcagagaacaacataa] # protein sequence = [MMKPCLYQKDKKLASSSKHYSSSRSGGGGGGGGCGGGGGVSSLRISSSKGSLGGGFSSGGFSGGSFSRGSSGGGCFGG # SSGGYGGLGGFGGGSFHGSYGSSSFGGSYGGSFGGGNFGGGSFGGGSFGGGGFGGGGFGGGFGGGFGGDGGLLSGNEKVTMQNLNDRLASYLDKVRAL # EESNYELEGKIKEWYEKHGNSHQGEPRDYSKYYKTIDDLKNQRTT] # Evidence for and against this transcript: # % of transcript supported by hints (any source): 20 # CDS exons: 1/3 # E: 1 # CDS introns: 0/2 # 5'UTR exons and introns: 0/0 # 3'UTR exons and introns: 0/0 # hint groups fully obeyed: 0 # incompatible hint groups: 5 # E: 3 (gi|154723761,gi|13907559,gi|154736078) # M: 2 # end gene g1 ###
Different kinds of information are printed after the hash signs, e.g. the applied AUGUSTUS version and parameter set, predicted coding sequence and amino acid sequence. Predictions and hints are given in tabulator separated gff format, i.e. the first column contains the target sequence, second column contains the source of the feature, third column contains the feature, forth column contains the feature start, fifth column contains the feature end, sixth column contains a score (if applicable), seventh column contains the strand, eightth column contains the reading frame and nineth column contains either for hints the grouping and source information, or for prediction lines the gene/transcript identifier.
Files that may optionally be contained in gene prediction archives:
- *.gtf - gene predictions in gtf format
- *.aa - gene predictions as protein fasta sequences
- *.codingseq - gene predictions as CDS DNA fasta sequences
- *.cdsexons - predicted exons in DNA fasta sequences
- *.mrna - predicted mRNA sequences (with UTRs) in DNA fasta sequences
- *.gbrowse - gene prediction track for the GBrowse genome browser
Click here to view a real AUGUSTUS prediction web service output!
It is important that you check the results of an AUGUSTUS gene prediction run. Do not trust predictions blindly! Prediction accuracy depends on the input sequence quality, on hints quality and on whether a given parameter set fits to the species of the supplied genomic sequence.
Training results
You find a detailed description of training results by clicking here.
I am not from academia/non-profit. What can I do?
Users who are not from academia or a non-profit organisation, and who are not using our web application for personal purposes, only, have the following options:
- Run the training web server application with a genome file and an externally created training gene file
- Run AUGUSTUS predictions ab initio or with an externally created hint file
- Purchase a BLAT license from http://www.kentinformatics.com/ and run the autoAug Pipeline locally
No submission of personalized human sequence data!
AUGUSTUS is tool for predicting genes in prokaryotic and eukaryotic sequences. AUGUSTUS has been trained for human (excellent parameter set, please don't retrain on your own), genes in the human reference assembly have already been predicted with human parameters. Our web server does not adhere to data security standards that are - by law - required for processing personalized human genome data. We ask all submitters to confirm the following by a checkbox:
"I am not submitting personalized human sequence data (mandatory)."
Why do I see a running dog when pressing the submission button?
As Loriot said (freely translated): Life without a dog is possible, but pointless. ... the animation is simply displayed to make the waiting time during job submission more pleasant ;-)