Directly to Contents

Navigation for:     Help
Help

This website contains short instructions and some frequently asked questions concerning


For more detailed instructions, please read Training Tutorial and Prediction Tutorial.



Contents

Why do I not get any results?
Why is the server busy?
What is the species name?
Why should I give my e-mail address?
File upload versus web link
Instructions for fasta headers
Which files must or can I submit for training AUGUSTUS?
Which files are required for predicting genes in a new genome?
Genome file
cDNA file
Protein file
Training gene structure file
Hints file
Parameter archive
What is the project identifier?
What does my job status mean?
UTR prediction: yes or no?
Allowed gene structure
What about that data duplication?
Why is the prediction accuracy in the genome of my species not as good as I expected?
What about data privacy and security?
Gene prediction results
Training results
I am not from academia/not non-profit. What can I do?
Why do I see a running dog after pressing the submission button?



Why do I not get any results?

Seitenanfang Top of page



Why is the server busy?

Training AUGUSTUS is a very resource and time consuming process. We use a grid engine queuing system with a limited number of waiting slots. If we estimate that the time from job submission to computation start might be very long, our web server might display a message that our server is buisy. The submission of new jobs is then disabled (prediction and training submission will both be disabled). Please wait one or two weeks before you try a new submission. If the problem persists longer than a month, or if your job is urgent, please contact augustus-web@uni-greifswald.de.

Seitenanfang Top of page



What is the species name?

The species name is the name of the species for whose genome you want to train AUGUSTUS. The species name is an obligatory parameter. Considering that AUGUSTUS training is such a time consuming process, our objective is to know the names of species for which AUGUSTUS was trained in order to make the trained parameters available to the public so that others who are interested in the same species as you do not have to rerun the training process. (We will only explicitely publish your parameter set with the next AUGUSTUS release after confirming via e-mail that you agree to this.)

However, if you do not want to reveal the true species name, you may use any other string shorter than 30 characters as a species name.

The species name is not allowed to contain spaces!

Seitenanfang Top of page



Why should I give my e-mail address?

Unlike many other bioinformatics web services, the AUGUSTUS web server application is not an implementation of a fail-safe procedure. Particularly the assembly of a training gene set from extrinsic data (ESTs and protein sequences) and a genome sequence may not always work perfectly. Our pipeline may issue warnings or errors, and sometimes, we need to get some feedback from you via e-mail in order to figure out what is the problem with your particular input data set.

In addition, training and running AUGUSTUS are rather time consuming processes that may take up to several weeks (depending on the input data). It may be more convenient to receive an e-mail notification about your job having finished, than checking the status page over and over, again.

Therefore, we strongly recommend that you enter an e-mail adress.

If supplied, we use your e-mail address for the following purposes:

We do not use your e-mail address to send you any spam, i.e. about web service updates. We do not share your e-mail address with any third parties.

Job submission without giving an email adress is possible but discouraged.

Seitenanfang Top of page



The AUGUSTUS training and prediction web server application offers in some cases two possiblities for transferring files to the server: Upload a file and specify a web link to file.

You cannot do both at the same time! For each file type (e.g. the genome file), you must either select a file on your harddrive or give a web link!

Seitenanfang Top of page



Instructions for fasta headers

We observed that most problems with generating training genes for training AUGUSTUS are caused by fasta headers in the sequence files. Some of the tools in our pipeline will truncate fasta headers if they are too long or contain spaces, or contain special characters. This definitely leads to a lot of warning messages in the AutoAug.err file, and it may also lead to non-unique fasta entry names, which will lead to a crash of the pipeline. We therefore strongly recommend that you adhere to the following rules for fasta headers when using our web services:

In the following we give some header examples that will not cause problems:

>entry1
>contig1000
>est20
>scaffold239

The following kinds of headers will cause at least warning messages but probably also a pipeline crash:

>contig1 length=1000 Arabidopsis thaliana
>gi|123344545|some_protein|some_species
>Drosophila melanogaster scaffold 10000

If you have a fasta file with unsuitable headers and you do not know how to modify them automatically, you may use the Perl script simplifyFastaHeaders.pl. After saving it on your local Unix system, first check whether the location of Perl in the first line of the script is correct for your system (#!/usr/bin/perl). If Perl is installed in another location, you need to modify that line! Then, execute the script with the following parameters:

perl simplifyFastaHeaders.pl in.fa nameStem out.fa header.map

Why is the simplification of fasta headers not a built in function of the web service? The reason is that we think you should be able to recognize the predictions later on! Gene predictions will be made available in gff format, which contains the sequence name in the first column. Therefore, you should modify the fasta headers yourself, before submitting data to the web service!

Seitenanfang Top of page



Which files must or can I submit for training AUGUSTUS?

You need to specify

Please consider that training AUGUSTUS is a time and resource consuming process. For optimal results, you should specify as much information as possible for a single training run instead of starting the AUGUSTUS training multiple times with different file combinations! If you have a lot of EST data, we recommend that you submitt ESTs instead of protein sequences since ESTs will likely allow the generation of a UTR training set.

Seitenanfang Top of page



Which files are required for predicting genes in a new genome?

For predicting genes in a new genome with already trained parameters, you need to specify

You may in addition specify an EST/cDNA file and/or a hints file that will be used as extrinsic evidence for predicting genes.

Seitenanfang Top of page



Genome file

The genome file is an obligatory file for training AUGUSTUS and for making predictions with pre-trained parameters in a new genome. It must contain the genome sequence in (multiple) fasta format. Every header begins with a >. The sequence must be DNA. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n. (Internally, AUGUSTUS will interpret everyting that is not A a T t C c G g as an N!) Empty lines are not allowed. If they occur, they will automatically be removed by the webserver applications.

Headers must be unique within a file! We recommend that you use short fasta headers. Headers like

>gi|382483733|gb|GZ667513.1|GW667513 SSH_BP_47 Some species
Wicked root cDNA library Some species cDNA clone SSH_BP_47 
similar to Putative NADH-cytochrome B5 reductase, mRNA sequence

are likely to cause a lot of warning messages. An example for a short header created from the too long header above:

>GZ667513.1


Correct file format example:
>Chr.1
CCTCCTCCTGTTTTTCCCTCAATACAACCTCATTGGATTATTCAATTCAC
CATCCTGCCCTTGTTCCTTCCATTATACAGCTGTCTTTGCCCTCTCCTTC
TCTCGCTGGACTGTTCACCAACTCTCAGCCCGCGATCCCAATTTCCAGAC
AACCCATCTTATCAGCTTGGCCACGGCCTCGACCCGAACAGACCGGCGTC
CAGCGAGAAGAGCGTCGCCTCGACGCCTCTGCTTGACCGCACCTTGATGC
TCAAGACTTATCGCGATGCCAAGAAGCGTCTCATCATGTTCGACTACGA
>Chr.2
CGAAACGGGCACCTATACAACGATTGAAACCATTATTCAAGCTCAGCAAG
CGTCTATGCTAGCGGTTATTGCGAGCACTTCAGCGGTTGCTACTACGACT
ACTACTTGATAAATGAAACGGCTATAAAAGAGGCTGGGGCAAAAGTATGT
TAGTTGAAGGGTGACCTGAACGATGAATCGGTCGAATTTTTTATTGGCAG
AGGGAAGGTAGGTTTACTCAATTTAGTTACTTCTAGCCGTTGATTGGAGG
AGCGCAAGCGACGAGGAGGCTCATCGGCCGCCCGCGGAAAGCGTAGTCT
TACACGGAAATCAACGGCGGTGTCATAAGCGAG
>Chr.3
.....
            

The maximal number of scaffolds allowed in a genome file is 250000. If your file contains more scaffolds, please remove all short scaffolds. For training AUGUSTUS short scaffolds are worthless because no complete training genes can be generated from them. In terms of prediction, it is possible to predict genes in short scaffolds. However, those genes will in most cases be incomplete and probably unreliable.

Seitenanfang Top of page



cDNA file

The cDNA file is a multiple fasta DNA file that contains e.g. ESTs or full-length cDNA sequences. Allowed sequence characters: A a T t G g C c H h X x R r Y y W w S s M m K k B b V v D d N n U u. Empty lines are not allowed and will be removed from the submitted file by the webserver application. See Genome file for a format example. Upload of a cDNA file to our web server application will invoke the software BLAT [2], which is on our webserver application only available for academic, personal and non-profit use.

Seitenanfang Top of page



Protein file

The protein file is a multiple fasta file that contains protein sequences as supporting evidence for genes. Allowed sequence characters: A a R r N n D d C c E e Q q G g H h I i L l K k M m F f P p S s T t W w Y y V v B b Z z J j X x. Empty lines are not allowed but will simply be removed from the file by the webserver application.

Correct file format example:

>protein1
maaaafgqlnleepppiwgsrsvdcfekleqigegtygqvymakeiktgeivalkkirmd
neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey
mdhdltgladrpglrftvpqikcymkqlltglhychvnqvlhrdikgsnllidnegnlkl
adfglarsyshdhtgnltnrvitlwyrppelllgatkygp
>protein2
neregfpitaireikilkklhhenvihlkeivtspgrdrddqgkpdnnkykggiymvfey
mdhdltgladrpglrftvpqikcymkqlltglhychvnqv
>protein3
...
            

Submitting a protein file to our AUGUSTUS training web server application will invoke Scipio [3], which uses BLAT [2]. Therefore, protein file upload is only available for academic, personal and non-profit use on our web server application.

Seitenanfang Top of page



Training gene structure file

You can submit your own, externally created training gene structures to the AUGUSTUS training web server application. Regardless of the format, gene structure files are not allowed to contain java metacharacters like "*" or "?".

Training gene structure files can be submitted in two different formats: Genbank format or gff format.

Training gene structure file in genbank format

Gene structures in genbank format must contain the coding sequence parts and flanking regions. Flanking regions are important because AUGUSTUS is supposed to differentiate between genes and intergenic regions. The length of flanking regions depends on the length of genes in the target genome. In our pipeline, flanking regions are set to the average gene length (exceptionally applying the extreme limits between 1000 and 10000 nt). It is very important to make sure that the flanking regions do not contain any other protein coding gene parts, i.e. we recommend to trim flanking regions in a way that will exclude other CDS parts.

It is important for our pipeline that the LOCUS names within a submitted training gene structure file are unique, i.e. you should not use the same LOCUS name more than one time!

Correct file format example (condensed view, the three dots represent further lines of sequences):

LOCUS       Chr.1_1-159458   159458 bp  DNA
FEATURES             Location/Qualifiers
     source          1..159458
     CDS             complement(join(2421..2655,3858..4005,4080..4235,5569..5857
                     ,10316..10534,155240..155458))
                     /gene="1474336"
BASE COUNT     49195 a   29117 c  28985 g   49950 t   2211 n
ORIGIN
        1 aaaatacatc acaatacatt taattcactt tccatcatcg agattaacga aaattattta
       61 aaatatcgaa gatgaaaata tcctcaagat gatactgaac ggctaagaaa aatacatcac
      121 acaactttaa ttcattttcc atcatcgaga ttaacgaaaa gaaaaaattt taactcccta
...
   159301 atacgccacc aggtatttcg cctgattgtt cctcgaatat cttctctctc tctatatata
   159361 tatatattac ttggcacgat aatcgtcgaa tcgttattta taaattgctt catctatcgc
   159421 gatatttttg caacaactct cgcttttctc tttccatt
//
LOCUS       Chr.1_313992-323129   9138 bp  DNA
FEATURES             Location/Qualifiers
     source          1..9138
     CDS             join(4001..4048,4989..5138)
                     /gene="194551"
BASE COUNT     2829 a   1502 c  1750 g   2948 t   109 n
ORIGIN
        1 ttttccttct ttcttttttt tttatttaca ttaatgagaa ttttcgcaaa tatttcatcg
       61 ctgccatcct tttttttcct cgacgtcaat cacgcgacac atttgttaga gaaatggatt
      121 ttaatcttga aaaaagaaaa atacaaatgc caacgcattt caaatccttt cctattatta
...
     9001 tcaacgaaac aaataattgc ttcacaaaat atcgcacgta acaacaatat agacttcaat
     9061 attcaacaat tcttttcctt tatacacaaa gatacacaaa atataaaagt tttaatactt
     9121 caacttcaac gaaacagg
//
            

If you want to train UTRs, you have to additionally incorporate mRNA information in your genbank file.

Correct file format example (including UTR training):

LOCUS       scf7180001240730_g20   526 bp  DNA
FEATURES             Location/Qualifiers
     source          1..526
     mRNA            99..125
     CDS             99..99
BASE COUNT     164 a   99 c  68 g   195 t
ORIGIN
        1 gtgacggagc ccaaggacga gcccgtgccc tcagagccca cgtccgacgt gaggcccgcg
       61 ccagcgcccc tcccgccgcc cgtcgcagcc actgcttaga ctttactaat ataaacattg
      121 aaaatatttt gtgttttatt tccaatcatt gaattataat cctattataa tataactaac
      181 attcgtaatt ttacaaaata actatgcaaa ttattttgta ttttcgtttt aaattatact
      241 tttcatataa atttctacaa atcttattca agaccataag tatccgctcg ctctacttcg
      301 ggcatttcct ttatttatat cttatttgac ttattttgat tatttaggct tatgttttcg
      361 atactattga aaacagaaaa taatttcata taattaataa tatattttca attaatatat
      421 ttaacaaata tttgtatagt tcaagcggac aaatccgttc ccatagtatt tatataaatt
      481 ttaatttaga gtaataacag tttgctgtat tgttgtagtc aaatac
//
LOCUS       scf7180001240751_g30   876 bp  DNA
FEATURES             Location/Qualifiers
     source          1..876
     mRNA            complement(401..777)
     CDS             complement(777..777)
BASE COUNT     300 a   136 c  116 g   324 t
ORIGIN
        1 aatgtaggaa aatgaaatat ttatttaaat tgttattatc acttcttcgc tctagtgtct
       61 tggcaaagcg cggcgttgag ttcagcctct cacacgcaat gcctccagaa ttcggcgaaa
      121 tgtgggggac agagtgtatt aacactaagt tccctcagcc acgactggtg aaattatata
      181 ttcagtttgt atactattac tcatgcaaac acttcatcat actttcactc aatcagtaaa
      241 gcataatatt ttatttaata ttgtttatca atactatttc cttgttgtta aatattattt
      301 tatttattat attaaattaa aatgtcaaaa ttaaaagtag gtgatgattt attactatct
      361 tttctatcca agaaaaaaaa gacacactga aacaattgta atttttgtta tgtttttatt
      421 acttaatatt attataaaaa tttgtaaata cgaaataaaa tagatagacg taataatatt
      481 tatttgttag ttaataataa taatgataat tacgaaagat acaagaaata tgcataaatg
      541 agtgttatat tatgtatttt atgagaatat aaatataaaa actgtcattg attatatttt
      601 ctaaatactt tcattttatg gcttgctggc ttttcaattt ccttatgttt cagcttttca
      661 ctcaatagag cgaaaccttc atcgacatgt aagccaatag aacaattaca aactaacttt
      721 attacatcag tcttttcatt tctttaagct tcaggcaaat atcatctaaa tgcctttcaa
      781 ctcgctacta acatcgcgtc gttatataaa tcagtgtata cggaattaaa cctgtcatgt
      841 ctcttgcaag acgtgtctgc tgttgtcacg cacaca
//
            

Training gene structure file in gff format

Training gene structure in gff format must comply with the fasta entry names of the genome file.

In general, gff format must contain the following columns (The columns are separated by tabulators):

  1. The sequence names must be found in the fasta headers of sequences in the genome file.
  2. The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
  3. The feature may for AUGUSTUS training be CDS, 5'-UTR or 3'-UTR.
  4. Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
  5. Stop position, must be at least as large as start position.
  6. The score must be a number but the number is irrelevant to our web server applications.
  7. The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
  8. Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
  9. Attribute contains a transcript identifier. All gff-entries belonging to one transcript must contain the same transcript identifier in the last column.

Correct file format example (without UTR):

Chr.1	mySource	CDS	1767	1846	1.000	-	0	transcript_id "1597_1"
Chr.1	mySource	CDS	1666	1709	1.000	-	1	transcript_id "1597_1"
Chr.1	mySource	CDS	1486	1605	1.000	-	2	transcript_id "1597_1"
Chr.1	mySource	CDS	1367	1427	1.000	-	2	transcript_id "1597_1"
Chr.1	mySource	CDS	1266	1319	1.000	-	1	transcript_id "1597_1"
Chr.1	mySource	CDS	1145	1181	1.000	-	1	transcript_id "1597_1"
Chr.1	mySource	CDS	847	1047	1.000	-	0	transcript_id "1597_1"
Chr.2	mySource	CDS	9471	9532	1.000	+	0	transcript_id "1399_2"
Chr.2	mySource	CDS	9591	9832	1.000	+	1	transcript_id "1399_2"
Chr.2	mySource	CDS	9885	10307	1.000	+	2	transcript_id "1399_2"
Chr.2	mySource	CDS	10358	10507	1.000	+	2	transcript_id "1399_2"
Chr.2	mySource	CDS	10564	10643	1.000	+	2	transcript_id "1399_2"

Correct file format example (with UTR):

Chr.1	mySource	5'-UTR	277153	277220	45	+	.	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	277221	277238	1	+	0	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	278100	278213	1	+	0	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	278977	279169	1	+	0	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	279630	279648	0.94	+	2	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	279734	279768	0.94	+	1	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	CDS	280307	280344	1	+	2	transcript_id "g22472.t1"; gene_id "g22472";
Chr.1	mySource	3'-UTR	280345	280405	78	+	.	transcript_id "g22472.t1"; gene_id "g22472";

Seitenanfang Top of page



Hints file

For the gene prediction web server application, it is possible to submit an externally created file that contains extrinsic evidence for gene structures in gff format.

In general, gff format must contain the following columns (The columns are separated by tabulators):

  1. The sequence names must be found in the fasta headers of sequences in the genome file.
  2. The source tells with which software/process the gene structure was generated (you can fill in whatever you like).
  3. The feature may for AUGUSTUS gene prediction be
    • start - translation start, specifies an interval that contains the start codon. The interval can be larger than 3 nucleotides, in which case every ATG in the interval gets a bonus.
    • stop - translation end (stop codon)
    • tss - transcription start site
    • tts - transcription termination site
    • ass - acceptor (3') splice site, the last intron position
    • dss - donor (5') splice site, the first intron position
    • exonpart - part of an exon in the biological sense.
    • exon - complete exon in the biological sense.
    • intronpart - introns both between coding and non-coding exons.
    • intron - complete intron in the biological sense
    • CDSpart - part of the coding part of an exon. (CDS = coding sequence)
    • CDS - coding part of an exon with exact boundaries. For internal exons of a multi exon gene this is identical to the biological boundaries of the exon. For the first and the last coding exon the boundaries are the boundaries of the coding sequence (start, stop).
    • UTRpart - The hint interval must be included in the UTR part of an exon.
    • UTR - exact boundaries of a UTR exon or the untranslated part of a partially coding exon.
    • irpart - intergenic region part. The bonus applies to every base of the intergenic region. If UTR prediction is turned on (--UTR=on) then UTR is considered genic.
    • nonexonpart - intergenic region or intron.
    • genicpart - everything that is not intergenic region, i.e. intron or exon or UTR if applicable.
  4. Start is the beginning position of the line's feature, counting the first position of a sequence as position 1.
  5. Stop position, must be at least as large as start position.
  6. The score must be a number but the number is irrelevant to our web server applications.
  7. The strand denotes whether the gene is located on the forward (+) or on the reverse (-) strand.
  8. Frame is the reading frame, can be denoted as '.' if unknown or irrelevant. For exonpart and exon this is as defined as follows: On the forward strand it is the number of bases after (begin position 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) the next codon boundary comes (0, 1 or 2).
  9. For usage as hint, Attribute must contain the string source=M (for manual). Other sources, such EST or protein, are possible, but only in the command line version of AUGUSTUS. Source types other than M are ignored by AUGUSTUS web server applications.


Correct format example:

HS04636 anchor  exonpart        500     506     0       -       .       source=M
HS04636 anchor  exon            966     1017    0       +       0       source=M
HS04636 anchor  start           966     968     0       +       0       source=M
HS04636 anchor  dss             2199    2199    0       +       .       source=M
HS04636 anchor  stop            7631    7633    0       +       0       source=M
HS04636 anchor  intronpart      7631    7633    0       +       0       source=M
            

Seitenanfang Top of page


Parameter archive

A *.tar.gz archive with a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters:

where species is replaced by the name of the species you trained AUGUSTUS for (e.g. carrot would result it carrot/carrot_parameters.cfg). The additional species before the slash means that all those files must reside in a directory that is called species (or in our example: carrot) before you tar and gzip it. If you simply tar and gzip the folder that contains parameters of an AUGUSTUS training run, everything should work fine.

Seitenanfang Top of page



What is the project identifier?

If you trained AUGUSTUS on this webserver, you may instead of uploading a parameter archive, simply specify the project identifier of this training run. You find the project identifier for example in the subject line for your training confirmation e-mail, where it says Your AUGUSTUS training job project_id. Project identitfiers typically consist of the letters pred or train, followed by a random string of 8 digits resulting in for example train345kljD4.

Seitenanfang Top of page



What does my job status mean?

In the beginning, the status page will display that your job has been submitted. This means, the web server application is currently uploading your files and validating file formats. After a while, the status will change to waiting for execution. This means that all file formats have been confirmed and an AUGUSTUS training job has been submitted to our grid engine, but the job is still pending. Depending on waiting queue length, this status may persist for a while. Please contact us in case you job is pending for more than one month. Later, the job status will change to computing. This means the job is currently computing. When the page displays finished, all computations have been finished and a website with your job's results has been generated.

You will receive an e-mail with the link to the results of your job when computations are finished if you specified an email adress.

Seitenanfang Top of page



UTR prediction: yes or no?

It takes significantly more time to predict UTRs but in addition to reporting UTRs, it usually is also a little more accurate on the coding regions when ESTs are given as extrinsic evidence.

UTR prediction is only possible if UTR parameter files exist for your species. Even if UTR parameter files exist for a species, you should make sure, that they are species specific, i.e. have actually been optimized for your target species. It is a waste of time to predict UTRs with general (template) parameters.

UTR prediction is only supported in combination with the following two gene structure constraints:

UTR prediction is not possible in combination with the gene structure constraints:

If no UTR parameter files exist for your species but you enables UTR prediction in the form, the web server application will overrule the choice to predict UTRs by simply not predicting any UTRs.

Species for which UTR parameters are available:

Seitenanfang Top of page



Allowed gene structure

Predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.

Predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.

Predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).

Predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.

Ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.

Seitenanfang Top of page



What about that data duplication?

We are trying to avoid data duplication. If you submitted some data that was already submitted before, by you or by somebody else, we will display a link to the previously submitted job.

Seitenanfang Top of page



Why is the prediction accuracy in the genome of my species not as good as I expected?

Gene prediction accuracy of AUGUSTUS in the genome of a certain species depends on the quality of training genes that were used for optimizing species specific parameters. The pipeline behind our AUGUSTUS training web server application offers a fully automated way of generating training genes, but it does not replace manual quality checks on the training genes that are often needed for improving the training gene set quality.

In order to improve accuracy, you could manually inspect the generated training genes and select a trustworthy subset and try retraining AUGUSTUS with this subset. It also helps to compare the training gene set to other sources of evidence that are not supported by our web server application, e.g. RNA-seq data.

Seitenanfang Top of page



What about data privacy and security

The results of your job submission (i.e. in case of the training web server application that means log files, trained parameters, training genes, ab initio gene predictions and gene prediction with hints; or in case of the prediction web server application the augustus prediction archive) are publicly available. The link to your job status contains a long, pseudo-random string (uuid), and one needs to guess the string in order to get access to the results - but this is not particularly secure!

Other users who submit exactly the same input files as have been submitted before, will be redirected to the results page of the previously submitted job. They do not need to guess the link.

Files that you upload to our server, e.g. sequence files, are not directly made available to anyone. However, if you chose to upload a file via http/ftp link, the link to your file is displayed on the job status page.

We are interested in redistributing high quality parameter sets for novel species with the AUGUSTUS release. We will not do so without your explicit permission.

Our server logs e-mail adresses, IP adresses and all job submission details. We store this data for a limited time in order to be able to trace back errors or e.g. contact you about a permission to publish parameter sets. By submitting a job, you agree that we log this data.

Please contact augustus-web@uni-greifswald.de if your particular job requires a more secure environment, e.g. as part of a collaboration.

Seitenanfang Top of page



Prediction results

After job computations have finished, you will receive an e-mail (if you supplied an e-mail adress). The job status web page may at this point in time look similar to this:

image of results example

This page should contain the file augustus.tar.gz. Please make a "right click" on the link and select "Save As" (or similar) to save the file on your local harddrive.

augustus.tar.gz is a gene prediction archive and its content depends on the input file combination. You can unpack the archive by typing tar -xzvf *.tar.gz into your shell. (You find more information about the software tar at the GNU tar website.)

Files that are always contained in gene prediction archives:


Format example AUGUSTUS prediction gff file:
# This output was generated with AUGUSTUS (version 2.6).
# AUGUSTUS is a gene prediction tool for eukaryotes written by Mario Stanke (mario.stanke@uni-greifswald.de)
# and Oliver Keller (keller@cs.uni-goettingen.de).
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# reading in the file /var/tmp/augustus/AUG-1855139717/hints.gff ...
# Setting 1group1gene for E.
# Sources of extrinsic information: M E 
# Have extrinsic information about 1 sequences (in the specified range). 
# Initialising the parameters ...
# human version. Use default transition matrix.
# Looks like /var/tmp/augustus/AUG-1855139717/input.fa is in fasta format.
# We have hints for 1 sequence and for 1 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 6483, name = HSACKI10) -----
#
# Delete group HintGroup , 5803-5803, mult= 1, priority= -1 1 features
# Forced unstranded hint group to the only possible strand for 3 groups.
# Deleted 1 groups because some hint was not satisfiable.
# Constraints/Hints:
HSACKI10	anchor	start	182	184	0	+	.	src=M
HSACKI10	anchor	stop	3058	3060	0	+	.	src=M
HSACKI10	anchor	dss	4211	4211	0	+	.	src=M
HSACKI10	b2h	ep	1701	2075	0	.	.	grp=154723761;pri=4;src=E
HSACKI10	b2h	ep	1716	2300	0	+	.	grp=13907559;pri=4;src=E
HSACKI10	b2h	ep	1908	2300	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	ep	3592	3593	0	+	.	grp=13907559;pri=4;src=E
HSACKI10	b2h	ep	3836	3940	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	ep	5326	5499	0	+	.	grp=27937842;pri=4;src=E
HSACKI10	b2h	ep	5805	6157	0	+	.	grp=27937842;pri=4;src=E
HSACKI10	b2h	exon	3142	3224	0	+	.	grp=13907559;pri=4;src=E
HSACKI10	b2h	exon	3142	3224	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	exon	3592	3748	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	anchor	intronpart	5000	5100	0	+	.	src=M
HSACKI10	b2h	intron	2301	3141	0	+	.	grp=13907559;pri=4;src=E
HSACKI10	b2h	intron	2301	3141	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	intron	3225	3591	0	+	.	grp=13907559;pri=4;src=E
HSACKI10	b2h	intron	3225	3591	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	intron	3749	3835	0	+	.	grp=154736078;pri=4;src=E
HSACKI10	b2h	intron	5500	5804	0	+	.	grp=27937842;pri=4;src=E
HSACKI10	anchor	CDS	6194	6316	0	-	0	src=M
HSACKI10	anchor	CDSpart	5900	6000	0	+	.	src=M
# Predicted genes for sequence number 1 on both strands
# start gene g1
HSACKI10	AUGUSTUS	gene	182	3060	0.63	+	.	g1
HSACKI10	AUGUSTUS	transcript	182	3060	0.63	+	.	g1.t1
HSACKI10	AUGUSTUS	start_codon	182	184	.	+	0	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	initial	182	225	1	+	0	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	internal	1691	2300	0.86	+	1	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	terminal	3049	3060	0.74	+	0	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	CDS	182	225	1	+	0	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	CDS	1691	2300	0.86	+	1	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	CDS	3049	3060	0.74	+	0	transcript_id "g1.t1"; gene_id "g1";
HSACKI10	AUGUSTUS	stop_codon	3058	3060	.	+	0	transcript_id "g1.t1"; gene_id "g1";
# coding sequence = [atgatgaaaccctgtctctaccaaaaagacaaaaaattagccagctcaagcaagcactactcttcctcccgcagtggag
# gaggaggaggaggaggaggatgtggaggaggaggaggagtgtcatccctaagaatttctagcagcaaaggctcccttggtggaggatttagctcaggg
# gggttcagtggtggctcttttagccgtgggagctctggtgggggatgctttgggggctcatcaggtggctatggaggattaggaggttttggtggagg
# tagctttcatggaagctatggaagtagcagctttggtgggagttatggaggcagctttggagggggcaatttcggaggtggcagctttggtgggggca
# gctttggtggaggcggctttggtggaggcggctttggaggaggctttggtggtggatttggaggagatggtggccttctctctggaaatgaaaaagta
# accatgcagaatctgaatgaccgcctggcttcctacttggacaaagttcgggctctggaagaatcaaactatgagctggaaggcaaaatcaaggagtg
# gtatgaaaagcatggcaactcacatcagggggagcctcgtgactacagcaaatactacaaaaccatcgatgaccttaaaaatcagagaacaacataa]
# protein sequence = [MMKPCLYQKDKKLASSSKHYSSSRSGGGGGGGGCGGGGGVSSLRISSSKGSLGGGFSSGGFSGGSFSRGSSGGGCFGG
# SSGGYGGLGGFGGGSFHGSYGSSSFGGSYGGSFGGGNFGGGSFGGGSFGGGGFGGGGFGGGFGGGFGGDGGLLSGNEKVTMQNLNDRLASYLDKVRAL
# EESNYELEGKIKEWYEKHGNSHQGEPRDYSKYYKTIDDLKNQRTT]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 20
# CDS exons: 1/3
#      E:   1 
# CDS introns: 0/2
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 5
#      E:   3 (gi|154723761,gi|13907559,gi|154736078)
#      M:   2 
# end gene g1
###         

Different kinds of information are printed after the hash signs, e.g. the applied AUGUSTUS version and parameter set, predicted coding sequence and amino acid sequence. Predictions and hints are given in tabulator separated gff format, i.e. the first column contains the target sequence, second column contains the source of the feature, third column contains the feature, forth column contains the feature start, fifth column contains the feature end, sixth column contains a score (if applicable), seventh column contains the strand, eightth column contains the reading frame and nineth column contains either for hints the grouping and source information, or for prediction lines the gene/transcript identifier.

Files that may optionally be contained in gene prediction archives:

Click here to view a real AUGUSTUS prediction web service output!

It is important that you check the results of an AUGUSTUS gene prediction run. Do not trust predictions blindly! Prediction accuracy depends on the input sequence quality, on hints quality and on whether a given parameter set fits to the species of the supplied genomic sequence.

Seitenanfang Top of page



Training results

You find a detailed description of training results by clicking here. To view a sample output, click here!

Seitenanfang Top of page



I am not from academia/non-profit. What can I do?

Users who are not from academia or a non-profit organisation, and who are not using our web application for personal purposes, only, have the following options:

Seitenanfang Top of page



Why do I see a running dog when pressing the submission button?

As Loriot said (freely translated): Life without a dog is possible, but pointless. ... the animation is simply displayed to make the waiting time during job submission more pleasant ;-)

Seitenanfang Top of page



CONTACT
Institute for Mathematics und Computer Sciences
Bioinformatics Group
Walther-Rathenau-Straße 47
17487 Greifswald
Germany
Tel.: +49 (0)3834 86 - 46 24
Fax: +49 (0)3834 86 - 46 40

augustus-web@uni-greifswald.de