advice to run RNAseq with Augustus

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

advice to run RNAseq with Augustus

Post by katharina »

Originally posted in the old forum by Ken on 21.06.2012 - 06:22
Hello,
I'm about to try out RNAseq with Augustus as detailed on http://bioinf.uni-greifswald.de/bioinf/ ... s.Augustus
I initially thought about using the raw RNAseq reads, but I also have the assembled transcriptome.
I want to try with GSNAP approach, but it wasn't initially clear to me on how to incorporate RNAseq reads from my samples (I have 9). If I combine all the reads together I end up with a file size of 30Gb PE reads with 10Gb orphan reads, and I'm not sure the softwares can cope (on top of the fact that I only have 32Gb of RAM)
- Should I run each RNAseq sample separately, and if so is it better to 'combine' the results at some stage to generate 'better' hints compared to individual runs? - how would I do this?
- If I do run all reads in one go, how long would this take with 8 cores and 32Gb RAM (assuming that is enough). The reads are around 83 nt long.
Thanks in advance,
Ken
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: advice to run RNAseq with Augustus

Post by katharina »

by katharina on 25.06.2012 - 11:40
Hi Ken,
there are a couple of points to be considered:
a) Creating a training gene set for training AUGUSTUS for a novel species. In this case, assembled transcripts might be helpful (although that does not always work well).
b) Predicting genes with AUGUSTUS using extrinsic evidence, e.g. RNASeq data. I assume in the following that you are asking for advice on this point.
In general, you loose information when you use assembled transcripts instead of raw reads to support gene structure predictions. One reason is that assembled transcripts are not always correct, another reason is that AUGUSTUS uses the coverage information (last column of the gff hints file), and last but not least, many reads may not be incorporated into the assembled transcripts but may nevertheless support certain gene structures (e.g. lowly expressed genes). I therefore recommend that you use the raw reads, at least to create intron hints.
You could additionally integrate assembled transcripts as described at http://bioinf.uni-greifswald.de/bioinf/ ... porateESTs .
Concerning memory requirements: you'll need to try it, yourself, but I think you'll be fine with 32 GB RAM. I don't have numbers at hand, but I run our pipeline for similar datasets on a 32 GB RAM machine without problems.
Concerning run time, you might want to consider running everything until the final step 6 in the GNSAP wiki separately for the 9 samples. You can run those 9 sample tasks parallel. Then pool all the hints that were created in step6, run AUGUSTUS, proceed to step7 "pooled", then run 9 samples parallel from step8 on, again. Finally pool at step10, again. You might need an additional step of sorting the bam files before intron hint creation.
I cannot give you an accurate time estimate because I usually split data sets and run different steps on different machines. I think that it will take several days, though.
Post Reply