Input Sequence Format:The input is one or more a DNA sequences in (multiple) FASTA format. Example:
>name of sequence 1
>name of sequence 2
Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted. There is no restriction on the number of sequences but the total sequence size must be below 8MB.
cDNA sequences (ESTs, mRNAs):In this section you can upload a set of ESTs and/or mRNA sequences in FASTA format that might match the query DNA sequence. Sanger ESTs and pyrosequencing (454) are both OK. Alien cDNA is OK, too, but the species should be so closely related that the sequence identity in coding regions is above 90 percent. The cDNA sequences are first aligned to the query sequence(s) using BLAT and from those alignments hints are generated for AUGUSTUS. AUGUSTUS will use the information to predict a gene structure in agreement with the cDNA alignment data. It will try to obey the hints and will also tend to predict fewer exons that are not supported by cDNA. This feature is only for personal, academic, and non-profit use as this is required by the BLAT license.
Note: There quality and abundance of cDNA data varies a lot. Therefore, one parameter set for the hints can not fit all application settings well. For example, when using full-length mRNA the hints should be more 'convincing' than when using ESTs. Consider downloading AUGUSTUS and adjusting the hints parameters for your application setting.
constraints that anchor the prediction:This option allows the user to force AUGUSTUS to predict an exon, a splice site, a translation start or a translation end point at a certain position in the sequence. We call this a constraint or an anchor. The number and the order of the constraints are arbitrary.
Supported constraints are
|start||The start codon at the translation start (requires ATG)|
|stop||The stop codon at the translation stop (requires TAA, TAG or TGA)|
|ass||Acceptor splice site, the last (most 3') position of an intron (requires AG consensus)|
|dss||Donor splice site, the first (most 5') position of an intron (requires GT consensus)|
|exonpart||An interval that is coding, i.e. it is contained in an exon (may exactly be an exon). May also be a single base.|
|exon||An interval that is an exon (an initial, internal, terminal or single exon).|
|intronpart||An interval that is contained in an intron. May be a single base.|
These constraints can either be uploaded in a file or entered or pasted in the input area. The constraints must be in GFF format as in the following example
HS04636, me, exonpart, 500, 506, 0, -, ., source=MThe columns of this format have the following meaning (from left to right).
- The name of the sequence as specified in the FASTA format sequence input file. It is possible to input several sequences at the same time. Then each can have its own constraints.
- A name for the orign of the constraint. Use 'anchor' to enable display of constraints in Gbrowse together with predictions.
- Constraint type. One of start, stop, exonpart, exon, dss, ass, intronpart.
- Begin position. Start counting with 1 at the first position of the input sequence. For dss and ass the begin and end position are equal and must be the position of the intron position right next to the splicing boundary. For start and stop the begin and end position specify the position of the first and last base of the start or stop codon, repectively, i.e. end = begin + 2.
- End position. Must be at least as large as the begin position. See also begin position.
- Score. A number irrelevant here.
- Strand. Must be '+' or '-'. For exon, exonpart and intronpart you can set a '.' if you want to allow both the forward and reverse strand.
- Reading frame. Can be a '.' if unknown or irrelevant. For exonpart and exon this is as defined in the GFF format. On the forward strand it is the number of bases after (begin position - 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) until the next codon boundary comes (0, 1 or 2).
- Attribute. Must containt the string source=M. M stands for manual. Other sources such EST or protein homology are possible but only in the command line version of AUGUSTUS. Then the hints may be ignored.
HS04636 anchor exonpart 500 500 0 . . source=Menforces a gene structure on HS04636 in which the 500th base is coding. Either a forward strand or a reverse strand exon will contain that base.
HS04636 anchor exonpart 500 510 0 + 0 source=Menforces a gene structure on HS04636 in which the 11 bases from 500 to 510 are coding on the forward strand. There is a codon boundary between positions 499 and 500.
The constraint set
HS04636 anchor dss 1000 1000 0 + . source=M HS04636 anchor intronpart 1000 2000 0 + . source=M HS04636 anchor ass 2000 2000 0 + . source=Menforces a gene structure on HS04636 with an intron on the forward strand from position 1000 to 2000.
The constraint set
HS04636 anchor start 1000 1002 0 + . source=M HS04636 anchor exonpart 1000 1100 0 + . source=M HS04636 anchor intronpart 1200 1200 0 + . source=Menforces a gene structure on HS04636 where a gene starts at 1000 with an exon that is at least 101 bp long but at most 200 bp long. Position 1200 must be in a forward strand intron.
The constraint set
seqname anchor dss 1001 1001 0 + . source=M seqname anchor intronpart 1001 1199 0 + . source=M seqname anchor exon 1200 1300 0 + . source=M seqname anchor intronpart 1301 1399 0 + . source=M seqname anchor exon 1400 1500 0 + . source=M seqname anchor intronpart 1501 1699 0 + . source=M seqname anchor ass 1699 1699 0 + . source=Mpins down an internal part of the gene structure. The predicted gene structure must contain an exon sequence a..1000, 1200..1300, 1400..1500, 1700..b for some variable positions a and b.
With the word gene structure I refer to any meaningful sequence of exons, introns, and intergenic regions. This includes the posibilities of having no genes at all and of having multiple genes. AUGUSTUS tries to predict a gene structure that
- is (biologically) consistent in the following way
- No exon contains an in-frame stop codon.
- The splices sites obey the GT-AG consensus. All complete genes start with ATG and end with a stop codon.
- Each gene ends before another gene starts.
- The length of single exons and introns exceed a minimal length (species dependent).
- obeys all user constraints.
allowed gene structure:predict any number of (possibly partial) genes: This option is set by default. AUGUSTUS may predict no gene at all, one or more genes. The genes at the boundaries of the input sequence may be partial. Partial here means that not all of the exons of a gene are contained in the input sequence, but it is assumed that the sequence starts or ends in a non-coding region.
predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.
predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).
predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.
ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.