Problem etraining CEGMA input

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Problem etraining CEGMA input

Post by katharina »

Originally posted in the old forum by ebioman on 12.02.2014 - 14:06

Hello
I used the CEGMA output in order to train Augustus.
1. took the gff and converted it to Augustus style (did it with awk since I did not have your script
2. I used your script to extract genbank-format:
./gff2gbSmallDNA.pl cegma_annotation.corrected2.gff genome 1000 cegma_mapped_V2.gb
3. the genbank file I wanted then to use for training (contained 270 entries)
4. after random-split I used 200 for training

Code: Select all

     etraining --species=tmp cegma_mapped_V2.gb.train
Problem: many errors with the following pattern

Code: Select all

 One CDS exon begins before the previous CDS exon ends.474 >= 302
 GBProcessor::getGeneList(): Intron has negative length.
 Encountered error after reading 0 annotations.

 etraining: ERROR
	No genbank sequences found.
Any idea what causes likely the problem ?
Thanks
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Problem etraining CEGMA input

Post by katharina »

by katharina on 13.02.2014 - 10:29
The cegma pipeline must not necessarily generate gene structures that are regarded as "valid" by AUGUSTUS, and if an structure is invalid, it is skipped. It seems that in your case, not a single valid gene structure was produced.
I sometimes get those errors, too, but usually not for the complete file. I simply remove the error-causing entries. I suppose in your case, you need a new training gene structure file, generated from a different source.
Katharina
Post Reply