WebAUGUSTUS Help


This website contains short instructions and some frequently asked questions concerning

  • the training of AUGUSTUS and
  • predicting genes in a new genomic sequence with pre-trained parameters.

For more detailed instructions, please read Training Tutorial and Prediction Tutorial.

Contents

Why do I not get any results?
Why is the server busy?
What is the species name?
Why should I give my e-mail address?
File upload versus web link
Instructions for fasta headers
Which files must or can I submit for training AUGUSTUS?
Which files are required for predicting genes in a new genome?
Genome file
cDNA file
Protein file
Training gene structure file
Hints file
Parameter archive
What is the project identifier?
What does my job status mean?
UTR prediction: yes or no?
Allowed gene structure
What about that data duplication?
Why is the prediction accuracy in the genome of my species not as good as I expected?
What about data privacy and security?
Gene prediction results
Training results
I am not from academia/not non-profit. What can I do?
No submission of personalized human sequence data!
Why do I see a running dog after pressing the submission button?



Why do I not get any results?

  • Did an obvious error occur?
    Please contact augustus-web@uni-greifswald.de if you are not sure what the error message is telling you.

    One frequently occurring error in the AutoAug.err file is the following:

    The file with UTR parameters for train****** does not seem to exist. This likely means that the UTR model has not been trained yet for train******.

    This error message tells you that no UTR parameters were trained for your species. If no other error messages are contained above the first UTR error message, the general results of your job are ok, you simply did not get UTR parameters and thus no predictions with UTR.

    Illegal division by zero at scripts/autoAugTrain.pl line 241.
    failed to execute: No such file or directory
    This error occurs when not training gene structures were generated/available. This may be caused by one of the following circumstances:

    • You supplied a genome and protein file. In this case, Scipio was not able to generate any complete gene structures from the data set. In most cases, some incomplete gene structures were produced, but since they frequently cause crashes in the augustus training routine, we do not use them within the web service.
    • The files that you supplied had long an complex fasta headers. This causes problems with PASA and Scipio. Take care that the fasta headers in all your files are unique, short, do not contain whitespaces or special characters.
  • Did you submit your job a long time ago and it seems to be "stuck" at the status of "computing"?
    Please contact augustus-web@uni-greifswald.de to inquire whether your job is really still running.
  • Did your job finish but there are just no parameters or predictions?
    The quality of results depends on the quality and combination of your input data. If the input data did e.g. not provide sufficient information for generating training genes, then no AUGUSTUS parameters will be optimized for your species, and no predictions will be made. In case of the gene prediction web server application, it is also possible that your submitted genome sequence does not contain any protein coding genes.

Top of page Top of page



Why is the server busy?

Training AUGUSTUS is a very resource and time consuming process. We use a grid engine queuing system with a limited number of waiting slots. If we estimate that the time from job submission to computation start might be very long, our web server might display a message that our server is busy. The submission of new jobs is then disabled (prediction and training submission will both be disabled). Please wait one or two weeks before you try a new submission. If the problem persists longer than a month, or if your job is urgent, please contact augustus-web@uni-greifswald.de.

Top of page Top of page



What is the species name?

The species name is the name of the species for whose genome you want to train AUGUSTUS. The species name is an obligatory parameter. Considering that AUGUSTUS training is such a time consuming process, our objective is to know the names of species for which AUGUSTUS was trained in order to make the trained parameters available to the public so that others who are interested in the same species as you do not have to rerun the training process. (We will only explicitely publish your parameter set with the next AUGUSTUS release after confirming via e-mail that you agree to this.)

However, if you do not want to reveal the true species name, you may use any other string shorter than 30 characters as a species name.

The species name is not allowed to contain spaces!

Top of page Top of page



Why should I give my e-mail address?

Unlike many other bioinformatics web services, the AUGUSTUS web server application is not an implementation of a fail-safe procedure. Particularly the assembly of a training gene set from extrinsic data (ESTs and protein sequences) and a genome sequence may not always work perfectly. Our pipeline may issue warnings or errors, and sometimes, we need to get some feedback from you via e-mail in order to figure out what is the problem with your particular input data set.

In addition, training and running AUGUSTUS are rather time consuming processes that may take up to several weeks (depending on the input data). It may be more convenient to receive an e-mail notification about your job having finished, than checking the status page over and over, again.

Therefore, we strongly recommend that you enter an e-mail address.

If supplied, we use your e-mail address for the following purposes:

  • Confirming your job submission
  • Confirming successful file upload (for large files via ftp/http)
  • Notifying you about your job having finished
  • Informing you about any problems that might occur during your particular job and asking questions about that job in order to solve those problems

We do not use your e-mail address to send you any spam, i.e. about web service updates. We do not share your e-mail address with any third parties. Please read our Data Privacy Protection declaration.

Job submission without giving an email address is possible but discouraged.

If you provide an e-mail address, we kindly ask to check the confirmation checkbox that you agree to the following terms:
"If I provide an e-mail address, I agree that it will be stored on the server until the computations of my job have finished. I agree to receive e-mails that are related to the particular WebAUGUSTUS job that I submitted."

Top of page Top of page



The AUGUSTUS training and prediction web server application offers in some cases two possibilities for transferring files to the server: Upload a file and specify a web link to file.

  • For small files, please click on the Browse-button and select a file on your harddrive.
    If you experience a Connection timeout (because your file was too large for this type of upload), please use the option for large files!
  • Large files can be retrieved from a public web link. Specify a valid ftp or http URL to your sequence file. Our server will fetch the file from the given address. (WebAUGUSTUS does not accept dropbox links!)

You cannot do both at the same time! For each file type (e.g. the genome file), you must either select a file on your harddrive or give a web link!

Top of page Top of page



Instructions for fasta headers

We observed that most problems with generating training genes for training AUGUSTUS are caused by fasta headers in the sequence files. Some of the tools in our pipeline will truncate fasta headers if they are too long or contain spaces, or contain special characters. This definitely leads to a lot of warning messages in the AutoAug.err file, and it may also lead to non-unique fasta entry names, which will lead to a crash of the pipeline. We therefore strongly recommend that you adhere to the following rules for fasta headers when using our web services:

  • no whitespaces in the headers
  • no special characters in the headers (e.g. !#@&|;)
  • make the headers as short as possible
  • let headers not start with a number but with a letter
  • let headers contain letters and numbers, only

In the following we give some header examples that will not cause problems:

>entry1
>contig1000
>est20
>scaffold239

The following kinds of headers will cause at least warning messages but probably also a pipeline crash:

>contig1 length=1000 Arabidopsis thaliana
>gi|123344545|some_protein|some_species
>Drosophila melanogaster scaffold 10000

If you have a fasta file with unsuitable headers and you do not know how to modify them automatically, you may use the Perl script simplifyFastaHeaders.pl. After saving it on your local Unix system, first check whether the location of Perl in the first line of the script is correct for your system (#!/usr/bin/perl). If Perl is installed in another location, you need to modify that line! Then, execute the script with the following parameters:

perl simplifyFastaHeaders.pl in.fa nameStem out.fa header.map

  • in.fa is the input fasta file, it must already be in valid fasta format
  • nameStem is a character descriptor that will be used as a start for all simplified headers, e.g. est, or contig, or protein, etc. Be aware that fasta headers must always be unique, so choose different nameStems for genome and cDNA and protein file!
  • out.fa
  • is the output fasta file with simplified fasta headers, this file can be processed by our web service.
  • header.map
  • is a map that contains the simplied header and the original header in a tabular separated format.

Why is the simplification of fasta headers not a built in function of the web service? The reason is that we think you should be able to recognize the predictions later on! Gene predictions will be made available in gff format, which contains the sequence name in the first column. Therefore, you should modify the fasta headers yourself, before submitting data to the web service!

Top of page Top of page



Which files must or can I submit for training AUGUSTUS?

You need to specify

Please consider that training AUGUSTUS is a time and resource consuming process. For optimal results, you should specify as much information as possible for a single training run instead of starting the AUGUSTUS training multiple times with different file combinations! If you have a lot of EST data, we recommend that you submit ESTs instead of protein sequences since ESTs will likely allow the generation of a UTR training set.

Top of page Top of page



Which files are required for predicting genes in a new genome?

For predicting genes in a new genome with already trained parameters, you need to specify

  • a genome file and
  • a parameter archive. Instead of uploading the archive, you may also enter a valid project identifier in case you trained AUGUSTUS on this web server and the training has already finished; or you may select pre-trained parameter set from the drop down menu.

You may in addition specify an EST/cDNA file and/or a hints file that will be used as extrinsic evidence for predicting genes.

Top of page Top of page



Genome file

The genome file is an obligatory file for training AUGUSTUS and for making predictions with pre-trained parameters in a new genome. It must contain the genome sequence in (multiple) fasta format. Every header begins with a >. The sequence must be DNA. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n. (Internally, AUGUSTUS will interpret everything that is not A a T t C c G g as an N!) Empty lines are not allowed. If they occur, they will automatically be removed by the webserver applications.

WebAUGUSTUS does have a strict limit for character per line for FASTA format files. Disobeying this restriction might cause memory issues on our server. We recommend to format sequences in FASTA files submitted to WebAUGUSTUS with a unix linebreak after 80 characters.

Headers must be unique within a file! We recommend that you use short fasta headers. Headers like

>gi|382483733|gb|GZ667513.1|GW667513 SSH_BP_47 Some species
Wicked root cDNA library Some species cDNA clone SSH_BP_47 
similar to Putative NADH-cytochrome B5 reductase, mRNA sequence

are likely to cause a lot of warning messages. An example for a short header created from the too long header above:

>GZ667513.1


Correct file format example:
>Chr.1
CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC
CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC
TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC
AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC
CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC
TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Chr.2
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG
CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT
ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT
TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG
AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG
AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT
TACACGGAAATCAACGGCGGTGTCATAAGCGAG
>Chr.3
.....
            

The maximal number of scaffolds allowed in a genome file is 250000. If your file contains more scaffolds, please remove all short scaffolds. For training AUGUSTUS short scaffolds are worthless because no complete training genes can be generated from them. In terms of prediction, it is possible to predict genes in short scaffolds. However, those genes will in most cases be incomplete and probably unreliable.

Top of page Top of page



cDNA file

The cDNA file is a multiple fasta DNA file that contains e.g. ESTs or full-length cDNA sequences. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n U u. Empty lines are not allowed and will be removed from the submitted file by the webserver application. See Genome file for a format example. Upload of a cDNA file to our web server application will invoke the software BLAT [2], which is on our webserver application only available for academic, personal and non-profit use.

Top of page Top of page



Protein file

The protein file is a multiple fasta file that contains protein sequences as supporting evidence for genes. Allowed sequence characters: A a R r N n D d C c E e Q q G g H h I i L l K k M m F f P p S s T t W w Y y V v B b Z z J j X x. Empty lines are not allowed but will simply be removed from the file by the webserver application.

Correct file format example:

>protein1
maaaafgqlnleepppiwgsrsvdcfekleqigegtygqvymakeiktgeivalkkirmd
neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey
mdhdltgladrpglrftvpqikcymkqlltglhychvnqvlhrdikgsnllidnegnlkl
adfglarsyshdhtgnltnrvitlwyrppelllgatkygp
>protein2
neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey
mdhdltgladrpglrftvpqikcymkqlltglhychvnqv
>protein3
...
            

Submitting a protein file to our AUGUSTUS training web server application will invoke Scipio [3], which uses BLAT [2]. Therefore, protein file upload is only available for academic, personal and non-profit use on our web server application.

Top of page Top of page



Training gene structure file

You can submit your own, externally created training gene structures to the AUGUSTUS training web server application. Regardless of the format, gene structure files are not allowed to contain java metacharacters like "*" or "?".

Training gene structure files can be submitted in two different formats: Genbank format or gff format.

Training gene structure file in genbank format

Gene structures in genbank format must contain the coding sequence parts and flanking regions. Flanking regions are important because AUGUSTUS is supposed to differentiate between genes and intergenic regions. The length of flanking regions depends on the length of genes in the target genome. In our pipeline, flanking regions are set to the average gene length (exceptionally applying the extreme limits between 1000 and 10000 nt). It is very important to make sure that the flanking regions do not contain any other protein coding gene parts, i.e. we recommend to trim flanking regions in a way that will exclude other CDS parts.

It is important for our pipeline that the LOCUS names within a submitted training gene structure file are unique, i.e. you should not use the same LOCUS name more than one time!

Correct file format example (condensed view, the three dots represent further lines of sequences):

LOCUS       Chr.1_1-159458   159458 bp  DNA
FEATURES             Location/Qualifiers
     source          1..159458
     CDS             complement(join(2421..2655,3858..4005,4080..4235,5569..5857
                     ,10316..10534,155240..155458))
                     /gene="1474336"
BASE COUNT     49195 a   29117 c  28985 g   49950 t   2211 n
ORIGIN
        1 aaaatacatc acaatacatt taattcactt tccatcatcg agattaacga aaattattta
       61 aaatatcgaa gatgaaaata tcctcaagat gatactgaac ggctaagaaa aatacatcac
      121 acaactttaa ttcattttcc atcatcgaga ttaacgaaaa gaaaaaattt taactcccta
...
   159301 atacgccacc aggtatttcg cctgattgtt cctcgaatat cttctctctc tctatatata
   159361 tatatattac ttggcacgat aatcgtcgaa tcgttattta taaattgctt catctatcgc
   159421 gatatttttg caacaactct cgcttttctc tttccatt
//
LOCUS       Chr.1_313992-323129   9138 bp  DNA
FEATURES             Location/Qualifiers
     source          1..9138
     CDS             join(4001..4048,4989..5138)
                     /gene="194551"
BASE COUNT     2829 a   1502 c  1750 g   2948 t   109 n
ORIGIN
        1 ttttccttct ttcttttttt tttatttaca ttaatgagaa ttttcgcaaa tatttcatcg
       61 ctgccatcct tttttttcct cgacgtcaat cacgcgacac atttgttaga gaaatggatt
      121 ttaatcttga aaaaagaaaa atacaaatgc caacgcattt caaatccttt cctattatta
...
     9001 tcaacgaaac aaataattgc ttcacaaaat atcgcacgta acaacaatat agacttcaat
     9061 attcaacaat tcttttcctt tatacacaaa gatacacaaa atataaaagt tttaatactt
     9121 caacttcaac gaaacagg
//
            

If you want to train UTRs, you have to additionally incorporate mRNA information in your genbank file.

Correct file format example (including UTR training):

LOCUS       scf7180001240730_g20   526 bp  DNA
FEATURES             Location/Qualifiers
     source          1..526
     mRNA            99..125
     CDS             99..99
BASE COUNT     164 a   99 c  68 g   195 t
ORIGIN
        1 gtgacggagc ccaaggacga gcccgtgccc tcagagccca cgtccgacgt gaggcccgcg
       61 ccagcgcccc tcccgccgcc cgtcgcagcc actgcttaga ctttactaat ataaacattg
      121 aaaatatttt gtgttttatt tccaatcatt gaattataat cctattataa tataactaac
      181 attcgtaatt ttacaaaata actatgcaaa ttattttgta ttttcgtttt aaattatact
      241 tttcatataa atttctacaa atcttattca agaccataag tatccgctcg ctctacttcg
      301 ggcatttcct ttatttatat cttatttgac ttattttgat tatttaggct tatgttttcg
      361 atactattga aaacagaaaa taatttcata taattaataa tatattttca attaatatat
      421 ttaacaaata tttgtatagt tcaagcggac aaatccgttc ccatagtatt tatataaatt
      481 ttaatttaga gtaataacag tttgctgtat tgttgtagtc aaatac
//
LOCUS       scf7180001240751_g30   876 bp  DNA
FEATURES             Location/Qualifiers
     source          1..876
     mRNA            complement(401..777)
     CDS             complement(777..777)
BASE COUNT     300 a   136 c  116 g   324 t
ORIGIN
        1 aatgtaggaa aatgaaatat ttatttaaat tgttattatc acttcttcgc tctagtgtct
       61 tggcaaagcg cggcgttgag ttcagcctct cacacgcaat gcctccagaa ttcggcgaaa
      121 tgtgggggac agagtgtatt aacactaagt tccctcagcc acgactggtg aaattatata
      181 ttcagtttgt atactattac tcatgcaaac acttcatcat actttcactc aatcagtaaa
      241 gcataatatt ttatttaata ttgtttatca atactatttc cttgttgtta aatattattt
      301 tatttattat attaaattaa aatgtcaaaa ttaaaagtag gtgatgattt attactatct
      361 tttctatcca agaaaaaaaa gacacactga aacaattgta atttttgtta tgtttttatt
      421 acttaatatt attataaaaa tttgtaaata cgaaataaaa tagatagacg taataatatt
      481 tatttgttag ttaataataa taatgataat tacgaaagat acaagaaata tgcataaatg
      541 agtgttatat tatgtatttt atgagaatat aaatataaaa actgtcattg attatatttt
      601 ctaaatactt tcattttatg gcttgctggc ttttcaattt ccttatgttt cagcttttca
      661 ctcaatagag cgaaaccttc atcgacatgt aagccaatag aacaattaca aactaacttt
      721 attacatcag tcttttcatt tctttaagct tcaggcaaat atcatctaaa tgcctttcaa
      781 ctcgctacta acatcgcgtc gttatataaa tcagtgtata cggaattaaa cctgtcatgt
      841 ctcttgcaag acgtgtctgc tgttgtcacg cacaca
//
            

Training gene structure file in gff format

Training gene structure in gff format must comply with the fasta entry names of the genome file.

In general, gff format must contain the following columns (The columns are separated by tabulators):

  1. The sequence names must be found in the fasta headers of sequences in the genome file.
  2. The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
  3. The feature may for AUGUSTUS training be CDS, 5'-UTR or 3'-UTR.
  4. Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
  5. Stop position, must be at least as large as start position.
  6. The score must be a number but the number is irrelevant to our web server applications.
  7. The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
  8. Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
  9. Attribute contains a transcript identifier. All gff-entries belonging to one transcript must contain the same transcript identifier in the last column.

Correct file format example (without UTR):

Chr.1 mySource CDS   1767  1846  1.000 -  0  transcript_id "1597_1"
Chr.1 mySource CDS   1666  1709  1.000 -  1  transcript_id "1597_1"
Chr.1 mySource CDS   1486  1605  1.000 -  2  transcript_id "1597_1"
Chr.1 mySource CDS   1367  1427  1.000 -  2  transcript_id "1597_1"
Chr.1 mySource CDS   1266  1319  1.000 -  1  transcript_id "1597_1"
Chr.1 mySource CDS   1145  1181  1.000 -  1  transcript_id "1597_1"
Chr.1 mySource CDS   847   1047  1.000 -  0  transcript_id "1597_1"
Chr.2 mySource CDS   9471  9532  1.000 +  0  transcript_id "1399_2"
Chr.2 mySource CDS   9591  9832  1.000 +  1  transcript_id "1399_2"
Chr.2 mySource CDS   9885  10307 1.000 +  2  transcript_id "1399_2"
Chr.2 mySource CDS   10358 10507 1.000 +  2  transcript_id "1399_2"
Chr.2 mySource CDS   10564 10643 1.000 +  2  transcript_id "1399_2"

Correct file format example (with UTR):

Chr.1 mySource 5'-UTR   277153   277220   45 +  .  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   277221   277238   1  +  0  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   278100   278213   1  +  0  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   278977   279169   1  +  0  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   279630   279648   0.94  +  2  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   279734   279768   0.94  +  1  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource CDS   280307   280344   1  +  2  transcript_id "g22472.t1"; gene_id "g22472";
Chr.1 mySource 3'-UTR   280345   280405   78 +  .  transcript_id "g22472.t1"; gene_id "g22472";

Top of page Top of page



Hints file

For the gene prediction web server application, it is possible to submit an externally created file that contains extrinsic evidence for gene structures in gff format.

Comment lines are not allowed.

It makes no sense to upload gene prediction files (e.g. augustus.gff from AUGUSTUS gene prediction) as hints to WebAUGUSTUS. This is therefore not allowed.

In general, gff format must contain the following columns (The columns are separated by tabulators):

  1. The sequence names must be found in the fasta headers of sequences in the genome file.
  2. The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
  3. The feature may for AUGUSTUS gene prediction be
    • start - translation start, specifies an interval that contains the start codon. The interval can be larger than 3 nucleotides, in which case every ATG in the interval gets a bonus.
    • stop - translation end (stop codon)
    • tss - transcription start site
    • tts - transcription termination site
    • ass - acceptor (3') splice site, the last intron position
    • dss - donor (5') splice site, the first intron position
    • exonpart - part of an exon in the biological sense.
    • exon - complete exon in the biological sense.
    • intronpart - introns both between coding and non-coding exons.
    • intron - complete intron in the biological sense
    • CDSpart - part of the coding part of an exon. (CDS = coding sequence)
    • CDS - coding part of an exon with exact boundaries. For internal exons of a multi exon gene this is identical to the biological boundaries of the exon. For the first and the last coding exon the boundaries are the boundaries of the coding sequence (start, stop).
    • UTRpart - The hint interval must be included in the UTR part of an exon.
    • UTR - exact boundaries of a UTR exon or the untranslated part of a partially coding exon.
    • irpart - intergenic region part. The bonus applies to every base of the intergenic region. If UTR prediction is turned on (--UTR=on) then UTR is considered genic.
    • nonexonpart - intergenic region or intron.
    • genicpart - everything that is not intergenic region, i.e. intron or exon or UTR if applicable.
  4. Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
  5. Stop position, must be at least as large as start position.
  6. The score must be a number but the number is irrelevant to our web server applications.
  7. The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
  8. Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
  9. For usage as hint, Attribute must contain the string source=M (for manual). Other sources, such EST or protein, are possible, but only in the command line version of AUGUSTUS. Source types other than M are ignored by AUGUSTUS web server applications.


Correct format example:

Chr.1 anchor  exonpart        500     506     0       -       .       source=M
Chr.1 anchor  exon            966     1017    0       +       0       source=M
Chr.1 anchor  start           966     968     0       +       0       source=M
Chr.1 anchor  dss             2199    2199    0       +       .       source=M
Chr.1 anchor  stop            7631    7633    0       +       0       source=M
Chr.1 anchor  intronpart      7631    7633    0       +       0       source=M
            

Top of page Top of page


Parameter archive

A *.tar.gz archive with a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters:

  • species/species_parameters.cfg
  • species/species_metapars.cfg
  • species/species_metapars.utr.cfg
  • species/species_exon_probs.pbl.withoutCRF
  • species/species_exon_probs.pbl
  • species/species_weightmatrix.txt
  • species/species_intron_probs.pbl
  • species/species_intron_probs.pbl.withoutCRF
  • species/species_igenic_probs.pbl
  • species/species_igenic_probs.pbl.withoutCRF

where species is replaced by the name of the species you trained AUGUSTUS for (e.g. carrot would result it carrot/carrot_parameters.cfg). The additional species before the slash means that all those files must reside in a directory that is called species (or in our example: carrot) before you tar and gzip it. If you simply tar and gzip the folder that contains parameters of an AUGUSTUS training run, everything should work fine.

Top of page Top of page



What is the project identifier?

If you trained AUGUSTUS on this webserver, you may instead of uploading a parameter archive, simply specify the project identifier of this training run. You find the project identifier for example in the subject line for your training confirmation e-mail, where it says Your AUGUSTUS training job project_id. Project identitfiers typically consist of the letters pred or train, followed by a random string of 8 digits resulting in for example train345kljD4.

Top of page Top of page



What does my job status mean?

In the beginning, the status page will display that your job has been submitted. This means, the web server application is currently uploading your files and validating file formats. After a while, the status will change to waiting for execution. This means that all file formats have been confirmed and an AUGUSTUS training job has been submitted to our grid engine, but the job is still pending. Depending on waiting queue length, this status may persist for a while. Please contact us in case you job is pending for more than one month. Later, the job status will change to computing. This means the job is currently computing. When the page displays finished, all computations have been finished and a website with your job's results has been generated.

You will receive an e-mail with the link to the results of your job when computations are finished if you specified an email address.

Top of page Top of page



UTR prediction: yes or no?

It takes significantly more time to predict UTRs but in addition to reporting UTRs, it usually is also a little more accurate on the coding regions when ESTs are given as extrinsic evidence.

UTR prediction is only possible if UTR parameter files exist for your species. Even if UTR parameter files exist for a species, you should make sure, that they are species specific, i.e. have actually been optimized for your target species. It is a waste of time to predict UTRs with general (template) parameters.

UTR prediction is only supported in combination with the following two gene structure constraints:

  • predict any number of (possibly partial) genes
  • only predict complete genes

UTR prediction is not possible in combination with the following constraints:

  • only predict complete genes - at least one
  • predict exactly one gene
  • ignore conflicts with other strand

If no UTR parameter files exist for your species but you enables UTR prediction in the form, the web server application will overrule the choice to predict UTRs by simply not predicting any UTRs.

Species for which UTR parameters are available:

  • Acyrthosiphon pisum (pea_aphid)
  • Amphimedon queenslandica (amphimedon)
  • Apis mellifera (honeybee1)
  • Bombus terrestris (bombus_terrestris2)
  • Caenorhabditis elegans (caenorhabditis)
  • Drosophila melanogaster (fly)
  • Homo sapiens (human)
  • Trichinella spiralis (trichinella)
  • Toxoplasma gondii (toxoplasma)
  • Arabidopsis thaliana (arabidopsis)
  • Chlamydomonas reinhartii (chlamy2011)
  • Galdieria sulphuraria (galdieria)
  • Solanum lycopersicum (tomato)

Top of page Top of page



Allowed gene structure

Predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.

Predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.

Predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).

Predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.

Ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.

Top of page Top of page



What about that data duplication?

We are trying to avoid data duplication. If you submitted some data that was already submitted before, by you or by somebody else, we will display a link to the previously submitted job.

Top of page Top of page



Why is the prediction accuracy in the genome of my species not as good as I expected?

Gene prediction accuracy of AUGUSTUS in the genome of a certain species depends on the quality of training genes that were used for optimizing species specific parameters. The pipeline behind our AUGUSTUS training web server application offers a fully automated way of generating training genes, but it does not replace manual quality checks on the training genes that are often needed for improving the training gene set quality.

In order to improve accuracy, you could manually inspect the generated training genes and select a trustworthy subset and try retraining AUGUSTUS with this subset. It also helps to compare the training gene set to other sources of evidence that are not supported by our web server application, e.g. RNA-seq data.

Top of page Top of page



What about data privacy and security

Please read our data privacy protection declaration (German language, only).

The results of your job submission (i.e. in case of the training web server application that means log files, trained parameters, training genes, ab initio gene predictions and gene prediction with hints; or in case of the prediction web server application the augustus prediction archive) are publicly available. The link to your job status contains a long, pseudo-random string (uuid), and one needs to guess the string in order to get access to the results - but this is not particularly secure!

Other users who submit exactly the same input files as have been submitted before, will be redirected to the results page of the previously submitted job. They do not need to guess the link.

Files that you upload to our server, e.g. sequence files, are not directly made available to anyone. However, if you chose to upload a file via http/ftp link, the link to your file is displayed on the job status page.

We are interested in redistributing high quality parameter sets for novel species with the AUGUSTUS release. We will not do so without your explicit permission. Please contact us if you would like to add your parameters to the AUGUSTUS repository.

Our server logs e-mail addresses, IP addresses and all job submission details. We store this data for up to 180 days in order to be able to trace back errors. E-mail addresses are stored until your job has finished. By submitting a job, you agree that we log this data.

Please contact augustus-web@uni-greifswald.de if your particular job requires a more secure environment, e.g. as part of a collaboration.

Top of page Top of page



Prediction results

After job computations have finished, you will receive an e-mail (if you supplied an e-mail address). The job status web page may at this point in time look similar to this:

image of results example

This page should contain the file augustus.tar.gz. Please make a "right click" on the link and select "Save As" (or similar) to save the file on your local harddrive.

augustus.tar.gz is a gene prediction archive and its content depends on the input file combination. You can unpack the archive by typing tar -xzvf *.tar.gz into your shell. (You find more information about the software tar at the GNU tar website.)

Files that are always contained in gene prediction archives:

  • *.gff - gene predictions in gff format

Format example AUGUSTUS prediction gff file:
# This output was generated with AUGUSTUS (version 2.6).
# AUGUSTUS is a gene prediction tool for eukaryotes written by Mario Stanke (mario.stanke@uni-greifswald.de)
# and Oliver Keller (keller@cs.uni-goettingen.de).
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# reading in the file /var/tmp/augustus/AUG-1855139717/hints.gff ...
# Setting 1group1gene for E.
# Sources of extrinsic information: M E 
# Have extrinsic information about 1 sequences (in the specified range). 
# Initializing the parameters ...
# human version. Use default transition matrix.
# Looks like /var/tmp/augustus/AUG-1855139717/input.fa is in fasta format.
# We have hints for 1 sequence and for 1 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 6483, name = Chr.1) -----
#
# Delete group HintGroup , 5803-5803, mult= 1, priority= -1 1 features
# Forced unstranded hint group to the only possible strand for 3 groups.
# Deleted 1 groups because some hint was not satisfiable.
# Constraints/Hints:
Chr.1 anchor   start 182   184   0  +  .  src=M
Chr.1 anchor   stop  3058  3060  0  +  .  src=M
Chr.1 anchor   dss   4211  4211  0  +  .  src=M
Chr.1 b2h   ep 1701  2075  0  .  .  grp=154723761;pri=4;src=E
Chr.1 b2h   ep 1716  2300  0  +  .  grp=13907559;pri=4;src=E
Chr.1 b2h   ep 1908  2300  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   ep 3592  3593  0  +  .  grp=13907559;pri=4;src=E
Chr.1 b2h   ep 3836  3940  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   ep 5326  5499  0  +  .  grp=27937842;pri=4;src=E
Chr.1 b2h   ep 5805  6157  0  +  .  grp=27937842;pri=4;src=E
Chr.1 b2h   exon  3142  3224  0  +  .  grp=13907559;pri=4;src=E
Chr.1 b2h   exon  3142  3224  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   exon  3592  3748  0  +  .  grp=154736078;pri=4;src=E
Chr.1 anchor   intronpart  5000  5100  0  +  .  src=M
Chr.1 b2h   intron   2301  3141  0  +  .  grp=13907559;pri=4;src=E
Chr.1 b2h   intron   2301  3141  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   intron   3225  3591  0  +  .  grp=13907559;pri=4;src=E
Chr.1 b2h   intron   3225  3591  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   intron   3749  3835  0  +  .  grp=154736078;pri=4;src=E
Chr.1 b2h   intron   5500  5804  0  +  .  grp=27937842;pri=4;src=E
Chr.1 anchor   CDS   6194  6316  0  -  0  src=M
Chr.1 anchor   CDSpart  5900  6000  0  +  .  src=M
# Predicted genes for sequence number 1 on both strands
# start gene g1
Chr.1 AUGUSTUS gene  182   3060  0.63  +  .  g1
Chr.1 AUGUSTUS transcript  182   3060  0.63  +  .  g1.t1
Chr.1 AUGUSTUS start_codon 182   184   .  +  0  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS initial  182   225   1  +  0  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS internal 1691  2300  0.86  +  1  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS terminal 3049  3060  0.74  +  0  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS CDS   182   225   1  +  0  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS CDS   1691  2300  0.86  +  1  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS CDS   3049  3060  0.74  +  0  transcript_id "g1.t1"; gene_id "g1";
Chr.1 AUGUSTUS stop_codon  3058  3060  .  +  0  transcript_id "g1.t1"; gene_id "g1";
# coding sequence = [atgatgaaaccctgtctctaccaaaaagacaaaaaattagccagctcaagcaagcactactcttcctcccgcagtggag
# gaggaggaggaggaggaggatgtggaggaggaggaggagtgtcatccctaagaatttctagcagcaaaggctcccttggtggaggatttagctcaggg
# gggttcagtggtggctcttttagccgtgggagctctggtgggggatgctttgggggctcatcaggtggctatggaggattaggaggttttggtggagg
# tagctttcatggaagctatggaagtagcagctttggtgggagttatggaggcagctttggagggggcaatttcggaggtggcagctttggtgggggca
# gctttggtggaggcggctttggtggaggcggctttggaggaggctttggtggtggatttggaggagatggtggccttctctctggaaatgaaaaagta
# accatgcagaatctgaatgaccgcctggcttcctacttggacaaagttcgggctctggaagaatcaaactatgagctggaaggcaaaatcaaggagtg
# gtatgaaaagcatggcaactcacatcagggggagcctcgtgactacagcaaatactacaaaaccatcgatgaccttaaaaatcagagaacaacataa]
# protein sequence = [MMKPCLYQKDKKLASSSKHYSSSRSGGGGGGGGCGGGGGVSSLRISSSKGSLGGGFSSGGFSGGSFSRGSSGGGCFGG
# SSGGYGGLGGFGGGSFHGSYGSSSFGGSYGGSFGGGNFGGGSFGGGSFGGGGFGGGGFGGGFGGGFGGDGGLLSGNEKVTMQNLNDRLASYLDKVRAL
# EESNYELEGKIKEWYEKHGNSHQGEPRDYSKYYKTIDDLKNQRTT]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 20
# CDS exons: 1/3
#      E:   1 
# CDS introns: 0/2
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 5
#      E:   3 (gi|154723761,gi|13907559,gi|154736078)
#      M:   2 
# end gene g1
###         

Different kinds of information are printed after the hash signs, e.g. the applied AUGUSTUS version and parameter set, predicted coding sequence and amino acid sequence. Predictions and hints are given in tabulator separated gff format, i.e. the first column contains the target sequence, second column contains the source of the feature, third column contains the feature, forth column contains the feature start, fifth column contains the feature end, sixth column contains a score (if applicable), seventh column contains the strand, eightth column contains the reading frame and nineth column contains either for hints the grouping and source information, or for prediction lines the gene/transcript identifier.

Files that may optionally be contained in gene prediction archives:

  • *.gtf - gene predictions in gtf format
  • *.aa - gene predictions as protein fasta sequences
  • *.codingseq - gene predictions as CDS DNA fasta sequences
  • *.cdsexons - predicted exons in DNA fasta sequences
  • *.mrna - predicted mRNA sequences (with UTRs) in DNA fasta sequences
  • *.gbrowse - gene prediction track for the GBrowse genome browser

Click here to view a real AUGUSTUS prediction web service output!

It is important that you check the results of an AUGUSTUS gene prediction run. Do not trust predictions blindly! Prediction accuracy depends on the input sequence quality, on hints quality and on whether a given parameter set fits to the species of the supplied genomic sequence.

Top of page Top of page



Training results

You find a detailed description of training results by clicking here.

Top of page Top of page



I am not from academia/non-profit. What can I do?

Users who are not from academia or a non-profit organisation, and who are not using our web application for personal purposes, only, have the following options:

  • Run the training web server application with a genome file and an externally created training gene file
  • Run AUGUSTUS predictions ab initio or with an externally created hint file
  • Purchase a BLAT license from http://www.kentinformatics.com/ and run the autoAug Pipeline locally

Top of page Top of page



No submission of personalized human sequence data!

AUGUSTUS is tool for predicting genes in prokaryotic and eukaryotic sequences. AUGUSTUS has been trained for human (excellent parameter set, please don't retrain on your own), genes in the human reference assembly have already been predicted with human parameters. Our web server does not adhere to data security standards that are - by law - required for processing personalized human genome data. We ask all submitters to confirm the following by a checkbox:
"I am not submitting personalized human sequence data (mandatory)."

Top of page Top of page



Why do I see a running dog when pressing the submission button?

As Loriot said (freely translated): Life without a dog is possible, but pointless. ... the animation is simply displayed to make the waiting time during job submission more pleasant ;-)

Top of page Top of page