Reusing hints files with different isolates

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Reusing hints files with different isolates

Post by katharina »

Originally posted in the old forum by Jorvis on 17.01.2014 - 07:15
I'm currently working on structural annotation of 40 genomes, many of which are different isolates of the same species. In a group of 5 isolates of the same species, for example, I have RNA-Seq data for one of them and followed the protocol to use those to generate an intron hints file. Because this hints file has molecule identifiers and coordinates specific to that one organism (and the RNA-Seq integration is a long process) is there a way to build any of the information from the hints file back into the species configuration files (via e-training or optimization steps) AFTER generating the hints file so other runs of augustus using the same species params would benefit from them?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Reusing hints files with different isolates

Post by katharina »

by Mario on 21.01.2014 - 11:15
I am not sure if I understand the question.

If a set of genome assemblies is from the same species you only need one parameter set for all of these genomes. As a rule of thumb, it is OK to use a cross-trained AUGUSTUS, trained on A but prediction on B,
when genome assemblies A and B are still alignable. That usually is the case far, far beyond the species boundary.

For the prediction part, when you have hints hints.A1.gff for genome assembly A_1 and want an annotation for A_1, A_2, ..., A_n I can think of these options:

1) Map hints.A_1.gff to the other n-1 genomes using pairwise alignments of A_1 with A_2, ...., A_1 with A_n. I recommend the UCSC liftOver tool. This gets you n-1 files hints.A_1.mapped.to.A_i.gff
for (i=2,..,n). Then run augustus on each genome individually with it's hints file.

2) Try the new comparative augustus (see AUGUSTUS-cgp.txt). For this, you need to make a multiple alignment of the genomes, though. Suggestion: Cactus aligner by Benedict Paten. The cgp functionality is under development, though, and this code not yet much tested. In the long run I expect this to work best.

3) Predict the gene structures in A_1 with the hints. Then map the gene structures to the other n-1 genomes. We have used also successfully used TransMap (by Mark Diekhans, UCSC) for this. This gives you hints in the other genomes that are then used with Augustus.

4) For completeness, the option that you want to avoid. Realigning the reads to the other genomes.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Reusing hints files with different isolates

Post by katharina »

by Jorvis on 22.01.2014 - 00:01
Thank you Mario, those are the sorts of options I was looking for.
Post Reply