>name of sequence 1
>name of sequence 2
Every letter other than a,c,g,t,A,C,G and T is interpreted as an unknown base. Digits and white spaces are ignored. The number of characters per line is not restricted. There is no restriction on the number of sequences but the total sequence length must be below 3 mega bases.
Note: There quality and abundance of cDNA data varies a lot. Therefore, one parameter set for the hints can not fit all application settings well. For example, when using full-length mRNA the hints should be more 'convincing' than when using ESTs. Consider downloading AUGUSTUS and adjusting the hints parameters for your application setting.
Supported constraints are
|start||The start codon at the translation start (requires ATG)|
|stop||The stop codon at the translation stop (requires TAA, TAG or TGA)|
|ass||Acceptor splice site, the last (most 3') position of an intron (requires AG consensus)|
|dss||Donor splice site, the first (most 5') position of an intron (requires GT consensus)|
|exonpart||An interval that is coding, i.e. it is contained in an exon (may exactly be an exon). May also be a single base.|
|exon||An interval that is an exon (an initial, internal, terminal or single exon).|
|intronpart||An interval that is contained in an intron. May be a single base.|
These constraints can either be uploaded in a file or entered or pasted in the input area. The constraints must be in GFF format as in the following example
HS04636, me, exonpart, 500, 506, 0, -, ., source=MThe columns of this format have the following meaning (from left to right).
- The name of the sequence as specified in the FASTA format sequence input file. It is possible to input several sequences at the same time. Then each can have its own constraints.
- A name for the orign of the constraint. Use 'anchor' to enable display of constraints in Gbrowse together with predictions.
- Constraint type. One of start, stop, exonpart, exon, dss, ass, intronpart.
- Begin position. Start counting with 1 at the first position of the input sequence. For dss and ass the begin and end position are equal and must be the position of the intron position right next to the splicing boundary. For start and stop the begin and end position specify the position of the first and last base of the start or stop codon, repectively, i.e. end = begin + 2.
- End position. Must be at least as large as the begin position. See also begin position.
- Score. A number irrelevant here.
- Strand. Must be '+' or '-'. For exon, exonpart and intronpart you can set a '.' if you want to allow both the forward and reverse strand.
- Reading frame. Can be a '.' if unknown or irrelevant. For exonpart and exon this is as defined in the GFF format. On the forward strand it is the number of bases after (begin position - 1) until the next codon boundary comes (0, 1 or 2). On the reverse strand it is the number of bases before (end position + 1) until the next codon boundary comes (0, 1 or 2).
- Attribute. Must containt the string source=M. M stands for manual. Other sources such EST or protein homology are possible but only in the command line version of AUGUSTUS. Then the hints may be ignored.
HS04636 anchor exonpart 500 500 0 . . source=Menforces a gene structure on HS04636 in which the 500th base is coding. Either a forward strand or a reverse strand exon will contain that base.
HS04636 anchor exonpart 500 510 0 + 0 source=Menforces a gene structure on HS04636 in which the 11 bases from 500 to 510 are coding on the forward strand. There is a codon boundary between positions 499 and 500.
The constraint set
HS04636 anchor dss 1000 1000 0 + . source=M HS04636 anchor intronpart 1000 2000 0 + . source=M HS04636 anchor ass 2000 2000 0 + . source=Menforces a gene structure on HS04636 with an intron on the forward strand from position 1000 to 2000.
The constraint set
HS04636 anchor start 1000 1002 0 + . source=M HS04636 anchor exonpart 1000 1100 0 + . source=M HS04636 anchor intronpart 1200 1200 0 + . source=Menforces a gene structure on HS04636 where a gene starts at 1000 with an exon that is at least 101 bp long but at most 200 bp long. Position 1200 must be in a forward strand intron.
The constraint set
seqname anchor dss 1001 1001 0 + . source=M seqname anchor intronpart 1001 1199 0 + . source=M seqname anchor exon 1200 1300 0 + . source=M seqname anchor intronpart 1301 1399 0 + . source=M seqname anchor exon 1400 1500 0 + . source=M seqname anchor intronpart 1501 1699 0 + . source=M seqname anchor ass 1699 1699 0 + . source=Mpins down an internal part of the gene structure. The predicted gene structure must contain an exon sequence a..1000, 1200..1300, 1400..1500, 1700..b for some variable positions a and b.
With the word gene structure I refer to any meaningful sequence of exons, introns, and intergenic regions. This includes the posibilities of having no genes at all and of having multiple genes. AUGUSTUS tries to predict a gene structure that
- is (biologically) consistent in the following way
- No exon contains an in-frame stop codon.
- The splices sites obey the GT-AG consensus. All complete genes start with ATG and end with a stop codon.
- Each gene ends before another gene starts.
- The length of single exons and introns exceed a minimal length (species dependent).
- obeys all user constraints.
predict only complete genes: AUGUSTUS assumes that the input sequence does not start or end within a gene. Zero or more complete genes are predicted.
predict only complete genes - at least one: As the previous option. But AUGUSTUS predicts at least one gene (if possible).
predict exactly one complete gene: AUGUSTUS assumes that the sequence contains exactly one complete gene. Note: This feature does not work properly in combination with alternative transcripts.
ignore conflicts with other strand: By default AUGUSTUS assumes that no genes - even on opposite strands - overlap. Indeed, this usually is the case but sometimes an intron contains a gene on the opposite strand. In this case, or when AUGUSTUS makes a false prediction on the one strand because it falsely thinks there is a conflicting gene on the other strand, AUGUSTUS should be run with this option set. It then predicts the genes on each strand separately and independently. This may lead to more false positive predictions, though.