Input Sequence Format:

The input is one or more a DNA sequences in (multiple) FASTA format. Example:

>name of sequence 1
agtgctgcatgctagctagct
cttgcatgctactgtgcata
>name of sequence 2
gtgctngcatgctagctagctggtgtnntgaaaaatt
...

Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted. There is no restriction on the number of sequences but the total sequence size must be below 8MB.

Expert Options:

cDNA sequences (ESTs, mRNAs):

In this section you can upload a set of ESTs and/or mRNA sequences in FASTA format that might match the query DNA sequence. Sanger ESTs and pyrosequencing (454) are both OK. Alien cDNA is OK, too, but the species should be so closely related that the sequence identity in coding regions is above 90 percent. The cDNA sequences are first aligned to the query sequence(s) using BLAT and from those alignments hints are generated for AUGUSTUS. AUGUSTUS will use the information to predict a gene structure in agreement with the cDNA alignment data. It will try to obey the hints and will also tend to predict fewer exons that are not supported by cDNA. This feature is only for personal, academic, and non-profit use as this is required by the BLAT license.

Note: There quality and abundance of cDNA data varies a lot. Therefore, one parameter set for the hints can not fit all application settings well. For example, when using full-length mRNA the hints should be more 'convincing' than when using ESTs. Consider downloading AUGUSTUS and adjusting the hints parameters for your application setting.

constraints that anchor the prediction:

This option allows the user to force AUGUSTUS to predict an exon, a splice site, a translation start or a translation end point at a certain position in the sequence. We call this a constraint or an anchor. The number and the order of the constraints are arbitrary.
Supported constraints are

constraint type	meaning
start	The start codon at the translation start (requires ATG)
stop	The stop codon at the translation stop (requires TAA, TAG or TGA)
ass	Acceptor splice site, the last (most 3') position of an intron (requires AG consensus)
dss	Donor splice site, the first (most 5') position of an intron (requires GT consensus)
exonpart	An interval that is coding, i.e. it is contained in an exon (may exactly be an exon). May also be a single base.
exon	An interval that is an exon (an initial, internal, terminal or single exon).
intronpart	An interval that is contained in an intron. May be a single base.

These constraints can either be uploaded in a file or entered or pasted in the input area. The constraints must be in GFF format as in the following example

HS04636	anchor	exonpart	500	506	0	-	.	source=M
HS04636	anchor	exon	        966	1017	0	+	0	source=M
HS04636	anchor	start	        966	968	0	+	0	source=M
HS04636	anchor	dss	        2199	2199	0	+	.	source=M
HS04636	anchor	stop	        7631	7633	0	+	0	source=M
HS04636	anchor	intronpart	7631	7633	0	+	0	source=M

The columns are separated by tabs, but as some browsers do not allow to enter a tab in a web form, you can substitute the tabs by commas when you chose to enter the constraints manually in the text area and have javascript activated. E.g.

HS04636, me, exonpart, 500, 506, 0, -, ., source=M

The columns of this format have the following meaning (from left to right).

The name of the sequence as specified in the FASTA format sequence input file. It is possible to input several sequences at the same time. Then each can have its own constraints.
A name for the orign of the constraint. Use 'anchor' to enable display of constraints in Gbrowse together with predictions.
Constraint type. One of start, stop, exonpart, exon, dss, ass, intronpart.
Begin position. Start counting with 1 at the first position of the input sequence. For dss and ass the begin and end position are equal and must be the position of the intron position right next to the splicing boundary. For start and stop the begin and end position specify the position of the first and last base of the start or stop codon, repectively, i.e. end = begin + 2.
End position. Must be at least as large as the begin position. See also begin position.
Score. A number irrelevant here.
Strand. Must be '+' or '-'. For exon, exonpart and intronpart you can set a '.' if you want to allow both the forward and reverse strand.
Reading frame. Can be a '.' if unknown or irrelevant. For exonpart and exon this is as defined in the GFF format. On the forward strand it is the number of bases after (begin position - 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) until the next codon boundary comes (0, 1 or 2).
Attribute. Must containt the string source=M. M stands for manual. Other sources such EST or protein homology are possible but only in the command line version of AUGUSTUS. Then the hints may be ignored.

Examples:

The constraint

HS04636	anchor	exonpart	500	500	0	.	.	source=M

enforces a gene structure on HS04636 in which the 500th base is coding. Either a forward strand or a reverse strand exon will contain that base.

The constraint

HS04636	anchor	exonpart	500	510	0	+	0	source=M

enforces a gene structure on HS04636 in which the 11 bases from 500 to 510 are coding on the forward strand. There is a codon boundary between positions 499 and 500.

The constraint set

HS04636	anchor	dss	        1000	1000	0	+	.	source=M
HS04636	anchor	intronpart	1000	2000	0	+	.	source=M
HS04636	anchor	ass	        2000	2000	0	+	.	source=M

enforces a gene structure on HS04636 with an intron on the forward strand from position 1000 to 2000.

The constraint set

HS04636	anchor	start	        1000	1002	0	+	.	source=M
HS04636	anchor	exonpart	1000	1100	0	+	.	source=M
HS04636	anchor	intronpart	1200	1200	0	+	.	source=M

enforces a gene structure on HS04636 where a gene starts at 1000 with an exon that is at least 101 bp long but at most 200 bp long. Position 1200 must be in a forward strand intron.

The constraint set

seqname	anchor	dss	        1001	1001	0	+	.	source=M
seqname	anchor	intronpart      1001	1199	0	+	.	source=M
seqname	anchor	exon	        1200	1300	0	+	.	source=M
seqname	anchor	intronpart      1301	1399	0	+	.	source=M
seqname	anchor	exon	        1400	1500	0	+	.	source=M
seqname	anchor	intronpart      1501	1699	0	+	.	source=M
seqname	anchor	ass	        1699	1699	0	+	.	source=M

pins down an internal part of the gene structure. The predicted gene structure must contain an exon sequence a..1000, 1200..1300, 1400..1500, 1700..b for some variable positions a and b.

With the word gene structure I refer to any meaningful sequence of exons, introns, and intergenic regions. This includes the posibilities of having no genes at all and of having multiple genes. AUGUSTUS tries to predict a gene structure that

is (biologically) consistent in the following way
- No exon contains an in-frame stop codon.
- The splices sites obey the GT-AG consensus. All complete genes start with ATG and end with a stop codon.
- Each gene ends before another gene starts.
- The length of single exons and introns exceed a minimal length (species dependent).
and
obeys all user constraints.

Among all gene structures that are consistent and that obey all user constraints, AUGUSTUS finds the most likely gene structure. A user constraints may contradict the biological consistency. For example a donor splice site where there is no GT in the sequence. Another example is an exonpart hint that may not be obeyed because there are stop codons in all 3 frames before the next exon boundaries are possible. If there is no consistent gene structure possible which classifies a given base as coding, then the constraint is ignored. Also, if two or more user constraints contradict each other, then AUGUSTUS obeys only that constraint that fits better to the model.

allowed gene structure:

predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.

predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.

predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).

predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.

ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.

Go to the submission page.

Augustus [help]

Input Sequence Format:

Expert Options:

cDNA sequences (ESTs, mRNAs):

constraints that anchor the prediction:

Examples:

allowed gene structure: