Page 1 of 1

run time with extrinsic data

Posted: Thu Nov 19, 2015 5:27 pm
by katharina
Originally posted in the old forum by Marcin on 19.08.2013 - 16:47

I am trying to use Augustus with RNA-Seq data. I have normalized/cleaned my RNA-Seq with Trinity In silico Read Normalization protocol and prepared evidence file according to http://bioinf.uni-greifswald.de/augustu ... naseq.html
With these extrinsic data Augustus slowed down to the point that it will take months to complete (on 24 cores). Is it normal or did I messed with the RNA-Seq evidence data? My RNA-Seq hints gff3 file has 390 MB and 6.8 million lines. The genome is Eukaryote size is 300 Mbases in 7,000 scaffolds.
I will appreciate any input.

Re: run time with extrinsic data

Posted: Thu Nov 19, 2015 5:27 pm
by katharina
by katharina on 21.08.2013 - 15:39
I recommend that you split the genome AND the hints files (so that each hints file contains hints for the target sequence, only). You find some instructions at http://bioinf.uni-greifswald.de/bioinf/ ... rallelPred (splitting the hints file is not described there, but should be possible using a simple grep command).
It is a commonly observed phenomenon that AUGUSTUS gets very slow when a huge number of hints is used.
Katharina

Re: run time with extrinsic data

Posted: Thu Nov 19, 2015 5:28 pm
by katharina
by marcin on 21.08.2013 - 23:27
Thanks a lot Katharina. I am splitting my jobs among 24 cores of SGE. My first time estimation was too pessimistic. After a longer sample run I estimate that Augustus will take approx. 1200 processor hours. Still a lot, but manageable.
I am also thinking about limiting the number of hints by:
1: Further normalization of the RNA-seq reads with 'normalize_by_kmer_coverage.pl' from Trinity. Currently I use --max_cov = 30 (as used for transcriptome assembly) but maybe lowering it to something like 10 would speed up augustus without ill effects.
2: using a splice junction detector like eg. TrueSight and using its output as intron hints instead of raw RNA-seq reads.
Would these approaches have sens?
Marcin

Re: run time with extrinsic data

Posted: Thu Nov 19, 2015 5:28 pm
by katharina
by Katharina on 22.08.2013 - 11:05
We have never done this, so I have no idea.
If you split the genome into even smaller chunks, AUGUSTUS will run faster. I split the human genome into ~1500 chunks to finish an AUGUSTUS run with many RNA-Seq hints on a Cluster with 176 CPUs within one day.