Gene prediction on resequenced genomes. Finding back previous information.

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

Originally posted in the old forum by EJ Blom on 18.02.2013 - 16:23
Dear augustus-users,
We are currently resequencing genomes for which a good reference genome and annotation is available. Since a good gene prediction (a training set is already available for Augustus) has already been performed for the reference genome, redoing this for all resequenced lines seems a bit inefficient. Ideally, we would like to redo a gene prediction that takes into account the information from the reference gene prediction. Using this approach, one would also obtain information concerning gene instances that have lost exons (due to a mutation in a intron/exon site).
I thought of the following approach:
- map the predicted proteins of the reference genome to the genome using exonerate (described in this tutorial: http://bioinf.uni-greifswald.de/bioinf/ ... teProteins)
- map the cDNA sequences of the reference genome to the genome
(described in this tutorial: http://bioinf.uni-greifswald.de/bioinf/ ... porateESTs)
Use results from both approaches as hints in my gene prediction.
Would this "force" Augustus to find back instances of genes that I would expect and additional genes as well as genes that are different due to mutations in the exon/intron boundaries.
Any thoughts/ideas/suggestions?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by katharina on 19.02.2013 - 11:28
Your approach sounds reasonable.
If the genomes are from the same species, or very closely related, you could consider to use Scipio instead of exonerate to generate hints (see http://bioinf.uni-greifswald.de/bioinf/ ... putAsHints).
Transmap might also be an interesting option for mapping the existing annotation on novel assemblies (see e.g. http://augustus.gobics.de/binaries/scri ... p2hints.pl). This has the advantage that you could map UTR annotations to the new assemblies.
Given that the hints are valid (possible), AUGUSTUS will try to predict genes on the basis of those hints, yes.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by EJ Blom on 20.02.2013 - 12:12
Thanks. I am running into some problems with the script that is described on http://bioinf.uni-greifswald.de/bioinf/ ... teProteins
This line:
cat tblastn.matches | perl -e 'while(<>){split; if ($q eq $_[0]){$t .= "\t$_[1]"} else {print "$q$t
"; $t="\t$_[1]";$q=$_[0];}} print "$q$t
";' > tblastn.matchlists
The tblastn.matches with the filtered blast files is readable and is approx 1.2GB big.
The tblastn.matchlists that is created after invoking the script is approx 30MB big but doesn't appear to contain any data.
I performed the split command on the file however and it produced a single file x00 of the same size (approx 30MB). Next I started the runExonerate.pl but exonerate isn't running yet, the perl script appears to be running for over 10 min.
Any ideas what is wrong?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by katharina on 20.02.2013 - 16:11
The one-liner that is used for creating the matchlist basically prints couples of sequence names. Possibly your query and target names contained characters that would split the line into more array elements?
It is always good to work with short, unique, non-whitespace, non-special character containing fasta headers.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by EJ Blom on 20.02.2013 - 17:37
Hi Katharina,
I think there is nothing wrong with my ids (my life also consists of having to deal with strange ids ;))
First lines of the blastoutput:
Query= Sly0
Length=467
Subject= SL2.40ch00
Looks ok doesn't it?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by katharina on 21.02.2013 - 10:30
I agree, the names look fine. If I were you, I'd build myself a small toy dataset and try to find out why exactly the perl one liner that we put on our wiki pages does not work for your particular BLAST output.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Gene prediction on resequenced genomes. Finding back previous information.

Post by katharina »

by EJ Blom on 25.02.2013 - 12:16
I went for the easy solution, just performing the exonerate analysis on the complete genome using the cluster.
Post Reply