run time with extrinsic data

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

run time with extrinsic data

Post by katharina »

Originally posted in the old forum by Marcin on 19.08.2013 - 16:47

I am trying to use Augustus with RNA-Seq data. I have normalized/cleaned my RNA-Seq with Trinity In silico Read Normalization protocol and prepared evidence file according to http://bioinf.uni-greifswald.de/augustu ... naseq.html
With these extrinsic data Augustus slowed down to the point that it will take months to complete (on 24 cores). Is it normal or did I messed with the RNA-Seq evidence data? My RNA-Seq hints gff3 file has 390 MB and 6.8 million lines. The genome is Eukaryote size is 300 Mbases in 7,000 scaffolds.
I will appreciate any input.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: run time with extrinsic data

Post by katharina »

by katharina on 21.08.2013 - 15:39
I recommend that you split the genome AND the hints files (so that each hints file contains hints for the target sequence, only). You find some instructions at http://bioinf.uni-greifswald.de/bioinf/ ... rallelPred (splitting the hints file is not described there, but should be possible using a simple grep command).
It is a commonly observed phenomenon that AUGUSTUS gets very slow when a huge number of hints is used.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: run time with extrinsic data

Post by katharina »

by marcin on 21.08.2013 - 23:27
Thanks a lot Katharina. I am splitting my jobs among 24 cores of SGE. My first time estimation was too pessimistic. After a longer sample run I estimate that Augustus will take approx. 1200 processor hours. Still a lot, but manageable.
I am also thinking about limiting the number of hints by:
1: Further normalization of the RNA-seq reads with 'normalize_by_kmer_coverage.pl' from Trinity. Currently I use --max_cov = 30 (as used for transcriptome assembly) but maybe lowering it to something like 10 would speed up augustus without ill effects.
2: using a splice junction detector like eg. TrueSight and using its output as intron hints instead of raw RNA-seq reads.
Would these approaches have sens?
Marcin
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: run time with extrinsic data

Post by katharina »

by Katharina on 22.08.2013 - 11:05
We have never done this, so I have no idea.
If you split the genome into even smaller chunks, AUGUSTUS will run faster. I split the human genome into ~1500 chunks to finish an AUGUSTUS run with many RNA-Seq hints on a Cluster with 176 CPUs within one day.
Post Reply