Page 1 of 1

Problem etraining CEGMA input

Posted: Fri Nov 20, 2015 1:00 pm
by katharina
Originally posted in the old forum by ebioman on 12.02.2014 - 14:06

Hello
I used the CEGMA output in order to train Augustus.
1. took the gff and converted it to Augustus style (did it with awk since I did not have your script
2. I used your script to extract genbank-format:
./gff2gbSmallDNA.pl cegma_annotation.corrected2.gff genome 1000 cegma_mapped_V2.gb
3. the genbank file I wanted then to use for training (contained 270 entries)
4. after random-split I used 200 for training

Code: Select all

     etraining --species=tmp cegma_mapped_V2.gb.train
Problem: many errors with the following pattern

Code: Select all

 One CDS exon begins before the previous CDS exon ends.474 >= 302
 GBProcessor::getGeneList(): Intron has negative length.
 Encountered error after reading 0 annotations.

 etraining: ERROR
	No genbank sequences found.
Any idea what causes likely the problem ?
Thanks

Re: Problem etraining CEGMA input

Posted: Fri Nov 20, 2015 1:00 pm
by katharina
by katharina on 13.02.2014 - 10:29
The cegma pipeline must not necessarily generate gene structures that are regarded as "valid" by AUGUSTUS, and if an structure is invalid, it is skipped. It seems that in your case, not a single valid gene structure was produced.
I sometimes get those errors, too, but usually not for the complete file. I simply remove the error-causing entries. I suppose in your case, you need a new training gene structure file, generated from a different source.
Katharina