Training with proteins - problems of ATG necessity

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Training with proteins - problems of ATG necessity

Post by katharina »

Originally posted in the old forum by ebioman on 07.03.2014 - 13:52
Hello
I encountered a problem and wondered whether it might be acutally a "feature" and if so, whether there might be a hack.
For the training I took manually curated proteins from a close species. Mapping the proteins onto the genome I found that often the Methione at the beginning of the protein would not match strictly to an ATG on the genome - even though the mapping otherwise might be useful.
I thought that might still hold useful information for the training, but augustus removes everything which does not start with an ATG.
So I wondered whether in the case e.g. few bases upstream another ATG is present in the genome sequence I could force him to use that - and if I do that whether it actually might spoil the prediction ?
I hope it was actually clear what I wanted to say ?
Thanks
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Training with proteins - problems of ATG necessity

Post by katharina »

by katharina on 01.12.2014 - 16:25
AUGUSTUS and etraining should not have a problem with the alternative start codons ATG, GTG and TTG. But it is correct that genes without a valid start codon (no ATG, no GTG, no TTG) cause an error message - and they should, because it is important for training that the gene structure is complete.
You could modify your training gene structures in the suggested way but I would not recommend it. I would rather try to find the real problem (why do the genes not have start codons?). If you shorten genes systematically and incorrectly, the trained gene length distribution will be incorrect and subsequently the predicted genes might be systematically predicted too short (and in addition probably systematically with the wrong start codon).
Post Reply