how to train AUGUSTUS for my study species with its own sequence and other species' protein

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

Originally posted in the old forum by Chih-Ming on 30.10.2012 - 03:53
Hi,
I am using AUGUSTUS to predict genes for a bird's genome. It seems possible to train the AUGUSTUS for my study bird by feeding the program this bird's genome sequence and chicken's protein sequence (from NCBI). But I do not know understand how to do it. I am running AUGUSTUS in the server in our lab. The tutorial only show how to train AUGUSTUS with one input, for example:
etraining --species=bug genes.gb.train
I am wondering can I put one protein file (in what formate?) and one sequence file (in fasta format?) at the same time when I run the "etraining" executable? If so, what scripts should I use?
Another less relevant question is about the flanking regions around training genes. The chicken genome (gbk) files downloaded from NCBI include coding regions and introns for each gene. Are the introns or non-coding regions what so meant flanking regions?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

by katharina on 30.10.2012 - 15:57
The answer to your question is basically, that you need to use autoAug.pl and not only etraining.
Have you installed all dependencies of the autoAug.pl pipeline on your local server? They include e.g. BLAT and Scipio. If this is the case, you can call autoAug.pl with the following parameters:

Code: Select all

autoAug.pl --genome=genome.fa --species=yourBird --trainingset=protein.fa -v
This will envoke Scipio to create training genes for training augustus from your genome and protein file, subsequently, augustus will be trained (includes etraining), and then, ab initio predictions will be produced.
If you should encounter any problems with installing autoAug.pl locally, we generally recommend that you submit your files to our web service at http://bioinf.uni-greifswald.de/webaugustus
With flanking regions, I meant intergenic regions, i.e. part of the genome before the gene start and part of the genome after the gene start. autoAug.pl and our web service will cerate training gene files with flanking regions, automatically.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

by Chih-Ming on 31.10.2012 - 11:51
Hi Katharina,
So the autoAug.pl uses the chicken protein (and maybe my bird's genome sequences) for training, and then performs ab initio prediction on the bird's genome. Will the result really different from that in which I manually train AUGUSTUS using the chicken's gene structure (from NCBI) and then do ab initio prediction on the bird's genome? The two procedures sound the same or very similar.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

by katharina on 01.11.2012 - 10:55
That depends on how similar the other bird species is to the chicken.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

by Anand Rao on 14.01.2013 - 00:20
Extending through phylogenetic neighbor
Hi Katharina,
If there is a concern about the accuracy of prediction because of the phylogenetic distance between the reference species used for training and the target species for which annotation is sought, can the AUGUSTUS training be done incrementally?
What I mean to ask is if we use intermediate bridge species that span the phylogenetic range between the two species, and perform this training step incrementally for each pair of species.
i.e rather than train species A using annotation information from the only available closest relative which is still phylogenetically distant species Z, instead we follow the circuitous route of
species Z to species Y, then
species Y to species X, then
...
species C to species B, and then finally
species B to species A.
Are there dangers in performing this sort of iterative and incremental training set creation?
Thanks,
Anand
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: how to train AUGUSTUS for my study species with its own sequence and other species' protein

Post by katharina »

by katharina on 14.01.2013 - 10:57
Hi Anand,
I personally haven't tried your approach.
If I would try it, I think that I'd be extra careful to achieve accurate gene sets for the intermediate species (and that's a lot of work because it involves integrating as much of extrinsic evidence as possible - to reduce the bias towards genes that AUGUSTUS can typically predict well ab initio, which is one of the problems that I see in your suggested method - , and manual inspection of gene examples in a browser in context with the extrinsic evidence).
I think that one pitfall might me the quality of the genome assemblies for all the intermediate species. If there is one in between that has as poor assembly, you might loose a certain type of proteins for further mappings. If there are assemblies of poor quality in between, you could consider to combine the final training gene set with genes from a different source, e.g. from the CEGMA core proteome.
Another idea would be to use a different tool for mapping proteins to a genome. You don't necessarily have to use Scipio, which indeed is rather limited in producing good gene structures for remotely related protein/genome combinations. (I have no idea what other tool produces good training gene structures, though. You might need to program something yourself in order to convert alignment data to gene structures.)
Post Reply