Page 1 of 1

Reusing hints files with different isolates

Posted: Thu Nov 19, 2015 7:28 pm
by katharina
Originally posted in the old forum by Jorvis on 17.01.2014 - 07:15
I'm currently working on structural annotation of 40 genomes, many of which are different isolates of the same species. In a group of 5 isolates of the same species, for example, I have RNA-Seq data for one of them and followed the protocol to use those to generate an intron hints file. Because this hints file has molecule identifiers and coordinates specific to that one organism (and the RNA-Seq integration is a long process) is there a way to build any of the information from the hints file back into the species configuration files (via e-training or optimization steps) AFTER generating the hints file so other runs of augustus using the same species params would benefit from them?

Re: Reusing hints files with different isolates

Posted: Thu Nov 19, 2015 7:29 pm
by katharina
by Mario on 21.01.2014 - 11:15
I am not sure if I understand the question.

If a set of genome assemblies is from the same species you only need one parameter set for all of these genomes. As a rule of thumb, it is OK to use a cross-trained AUGUSTUS, trained on A but prediction on B,
when genome assemblies A and B are still alignable. That usually is the case far, far beyond the species boundary.

For the prediction part, when you have hints hints.A1.gff for genome assembly A_1 and want an annotation for A_1, A_2, ..., A_n I can think of these options:

1) Map hints.A_1.gff to the other n-1 genomes using pairwise alignments of A_1 with A_2, ...., A_1 with A_n. I recommend the UCSC liftOver tool. This gets you n-1 files hints.A_1.mapped.to.A_i.gff
for (i=2,..,n). Then run augustus on each genome individually with it's hints file.

2) Try the new comparative augustus (see AUGUSTUS-cgp.txt). For this, you need to make a multiple alignment of the genomes, though. Suggestion: Cactus aligner by Benedict Paten. The cgp functionality is under development, though, and this code not yet much tested. In the long run I expect this to work best.

3) Predict the gene structures in A_1 with the hints. Then map the gene structures to the other n-1 genomes. We have used also successfully used TransMap (by Mark Diekhans, UCSC) for this. This gives you hints in the other genomes that are then used with Augustus.

4) For completeness, the option that you want to avoid. Realigning the reads to the other genomes.

Re: Reusing hints files with different isolates

Posted: Thu Nov 19, 2015 7:29 pm
by katharina
by Jorvis on 22.01.2014 - 00:01
Thank you Mario, those are the sorts of options I was looking for.