How to correct problematic loci and genes removed and filtered out by AUGUSTUS

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

How to correct problematic loci and genes removed and filtered out by AUGUSTUS

Post by katharina »

Originally posted by ThankGod Ebenizer in the old forum on 31.07.2014 - 18:55

Hello Dear,

I'm trying to train augustus for my organism (new organism that has not been trained before) using augustus auto training, and I have been running it as so:

Code: Select all

$ autoAugTrain.pl --useexisting --species=Euglena_gracilis --trainingset=exonerate_gene_models.gff --genome=genome.fasta >& augustus_training.log &
And then I get this error:

Code: Select all

GBProcessor::getGeneList(): Stop codon out of sequence bounds. Ignoring sequence. 
Encountered error after reading 44 annotations. 
GBProcessor::getGeneList(): Stop codon out of sequence bounds. Ignoring sequence. 
Encountered error after reading 63 annotations. 
GBProcessor::getGeneList(): Stop codon out of sequence bounds. Ignoring sequence. 
Encountered error after reading 67 annotations.
Followed by these:
ExonModel::processInternalExon: in-frame stop codon 
ExonModel::processInternalExon: in-frame stop codon 
ExonModel::processInternalExon: in-frame stop codon 
ExonModel::processInternalExon: in-frame stop codon 
ExonModel::processInternalExon: in-frame stop codon 
ExonModel::processInternalExon: in-frame stop codon
This prompted me to remove and filter out all the problematic loci and genes as so:
1. Convert the .gff to .gb file as so:

Code: Select all

   $ gff2gbSmallDNA.pl exonerate_gene_models.gff genome.fasta 1000 genes.raw.gb
2. Remove the problematic loci from the Genbank format file as so:

Code: Select all

$ etraining --species=generic --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err
3. Filter out the problematic genes as so:

Code: Select all

$ cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
$ filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
So, when I go:

Code: Select all

$ grep -c "LOCUS" genes.raw.gb genes.gb
this is what I get:

Code: Select all

> genes.raw.gb: 37502 
> genes.gb: 1
What this means is that there are 37502 loci of which 37501 is problematic, which just 1 gene being OK. This is like all of the loci are problematic, and all of the genes were filtered out.
So, when I open the the train.err file, this is what I see:

Code: Select all

> gene ID=gene_24240 transcr. 1 in sequence scaffold_3_1-4377: coding length not a multiple of 3. Skipping... 
> gene ID=gene_11160 transcr. 1 in sequence scaffold_10_1-1066: Single exon gene does not begin with start codon but with cgg 
> gene ID=gene_20354 transcr. 1 in sequence scaffold_13_6205-17145: Initial exon does not begin with start codon but with tct 
> gene ID=gene_20354 transcr. 1 in sequence scaffold_13_6205-17145: in-frame stop codon 
> gene ID=gene_3619 transcr. 1 in sequence scaffold_17_364-1717: Single exon gene does not begin with start codon but with atc 
> gene ID=gene_53340 transcr. 1 in sequence scaffold_27_2113-10791: Initial exon does not begin with start codon but with gtt 
> gene ID=gene_53340 transcr. 1 in sequence scaffold_27_2113-10791: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right? 
> gene ID=gene_21924 transcr. 1 in sequence scaffold_31_1345-3899: coding length not a multiple of 3. Skipping... 
> gene ID=gene_7817 transcr. 1 in sequence scaffold_39_1-458: coding length not a multiple of 3. Skipping... 
............................................................... ............................................................... ...............................................................
I'm emailing to know if there is a way I could correct these problematic loci (or filtered genes) that has been removed by augustus, as I couldn't train augustus with just 1 gene, since the 37501 are problematic?
Please let me know how I can correct this, or perhaps I would appreciate any advise on how I can address this?
Regards,
ThankGod
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: How to correct problematic loci and genes removed and filtered out by AUGUSTUS

Post by katharina »

Originally posted by katharina in the old forum on 13.08.2014 - 11:29

I have no experience in using exonerate to generate training genes. I usually use scipio if I need to generate the training genes from proteins and genome. (Scipio is also supported by the autoAug.pl pipeline.)
My first idea is to check whether --stopCodonExcludedFromCDS=true is actually the case. If ALL but very few genes have a problem, the stop codon might actually be part of the CDS.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: How to correct problematic loci and genes removed and filtered out by AUGUSTUS

Post by katharina »

Originally posted by Mbandi in the old forum on 15.08.2014 - 10:47

Hi Katharina,
I also get the error messages while running optimize_augustus.pl. This error did not occur when I generated the first try training. --stopCodonExcludedFromCDS is true in *_parameters.cfg.
Thanks,
Post Reply