gff file for training, accepted features?

Post by **katharina** » Fri Nov 20, 2015 1:18 pm

Originally posted in the old forum by Viola Manning on 15.01.2013 - 21:39

Hello Augustus folks,
I would like to train Augustus to annotate multiple genomes of an organism
that already is well annotated. To train Augustus, I see that I can use a
gene structure file in gff format, which I have, but the features in column
three are different than what the training tutorial says is allowed.
For example, the training tutorial says that the features may be CDS, 5'-UTR
or 3'-UTR, but my gff3 files contains additional lines and 5' and 3' are
written out as five_prime_UTR and three_prime_UTR. Here is an example:

Code: Select all

supercont1.8 FINAL_CALLGENES_1 gene 39095 40196 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 mRNA 39095 40196 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 start_codon 39095 39097 .	+ 0 ID 
supercont1.8 FINAL_CALLGENES_1 exon 39095 39160 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 exon 39260 39379 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 exon 39937 40014 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 exon 40077 40196 . +	. ID 
supercont1.8 FINAL_CALLGENES_1 CDS 39095 39160 . +	0 ID 
supercont1.8 FINAL_CALLGENES_1 CDS 39260 39379 . +	0 ID 
supercont1.8 FINAL_CALLGENES_1 CDS 39937 40014 . +	0 ID 
supercont1.8 FINAL_CALLGENES_1 CDS 40077 40196 . +	0 ID 
supercont1.8 FINAL_CALLGENES_1 stop_codon 40194 40196 .	+ 0 ID

Should I modify my gff file to have only these three lines and change the
UTRs to the abbreviations, or will the training server accommodate the gff3
file formatted as above as well?

Post by **katharina** » Fri Nov 20, 2015 1:19 pm

by katharina on 16.01.2013 - 10:58
Hi Viola,
the augustus web service will accept all features in your file, i.e. the upload of your file in its current format will be possible.
However, the web service calls scripts of the AUGUSTUS distribution. The script gff2gbSmallDNA.pl that we use to convert a gff and a genome file into a genbank format gene structure file for training AUGUSTUS, will ignore UTR entries that are not spelled exactly as "5'-UTR" or "3'-UTR". Therefore, you should indeed rename your UTR features.
It is worth noting that currently, the autoAug.pl pipeline that is invoked by the web service, uses UTR features in a training gene structure file ONLY in order to exclude UTRs of neighboring genes from the flanking region in the genbank file. Actual UTR parameter training is currently not performed from gene structure files. (UTR training is performed from automatically generated UTR training examples from a cDNA and a genome file, though.)
Therefore, I advise you to locally train UTR parameters using the instructions given at http://bioinf.uni-greifswald.de/bioinf/ ... TRTraining - if you are in need of UTR parameters.
You probably simplified the file format example for asking this question. If it is an excerpt from your actual file, you should pay attention to the last column: there must be a grouping tag for all features that are related to the same gene structure. This is not the case if ALL your grouping tags of all genes are named "ID".
Katharina

AUGUSTUS Forum

gff file for training, accepted features?

gff file for training, accepted features?

Re: gff file for training, accepted features?