Augustus /
ParallelPredRun AUGUSTUS predictions parallel The following instructions describe how to break down a gene prediction problem for a large genome into small jobs that can be run on a compute server or cluster via the Sun Grid Engine. Genome file and chr.lst preparation At first, you should split the genome file into single fasta entry files: splitMfasta.pl genome.fa --outputpath=outputPathPrefix Then you can rename the split single fasta files according to their fasta names: for f in genome.split.*; do NAME=`grep ">" $f`; mv $f ${NAME#>}.fa; done In addition, you need to run obtain the length of each fasta entry. You can do this in many different ways, but the script summarizeACGTcontent.pl genome.fa > summary.out may be helpful for this. It produces an output similar to this: 15072423 bases. chrI BASE COUNT 4835938 a 2695881 c 2692152 g 4848452 t 15279345 bases. chrII BASE COUNT 4878197 a 2769219 c 2762213 g 4869716 t 13783700 bases. chrIII BASE COUNT 4444651 a 2449148 c 2466334 g 4423567 t 17493793 bases. chrIV BASE COUNT 5711039 a 3034770 c 3017014 g 5730970 t 13794 bases. chrM BASE COUNT 4335 a 1225 c 2055 g 6179 t 20924149 bases. chrV BASE COUNT 6750394 a 3712061 c 3701398 g 6760296 t 17718866 bases. chrX BASE COUNT 5747199 a 3119708 c 3117874 g 5734085 t summary: BASE COUNT 32371753 a 17782012 c 17759040 g 32373265 t total 100286070bp gc: 0.35439669736784% You need the fasta name and the length, so e.g. grep all lines with the keyword base. In addition, you'll need to specify the path to the single fasta entry files, and the hints file for your AUGUSTUS predictions. Let's assume that the path to your single fasta entry genome files is grep "bases" summary.out | perl -pe 's/^(\d+) bases.\s+(\S+) \ BASE.*/$outputPathPrefix/$2\.fa\t$anotherPath\/hints.gff\t1\t$1/;' > chr.lst The resulting file $outputPathPrefix/chrI.fa $anotherPath/hints.gff 1 15072423 $outputPathPrefix/chrII.fa $anotherPath/hints.gff 1 15279345 $outputPathPrefix/chrIII.fa $anotherPath/hints.gff 1 13783700 $outputPathPrefix/chrIV.fa $anotherPath/hints.gff 1 17493793 $outputPathPrefix/chrM.fa $anotherPath/hints.gff 1 13794 $outputPathPrefix/chrV.fa $anotherPath/hints.gff 1 20924149 $outputPathPrefix/chrX.fa $anotherPath/hints.gff 1 17718866 You omit the second column from $outputPathPrefix/chrI.fa 1 15072423 $outputPathPrefix/chrII.fa 1 15279345 $outputPathPrefix/chrIII.fa 1 13783700 $outputPathPrefix/chrIV.fa 1 17493793 $outputPathPrefix/chrM.fa 1 13794 $outputPathPrefix/chrV.fa 1 20924149 $outputPathPrefix/chrX.fa 1 17718866 Creating the AUGUSTUS jobs Let $augDir be the directory to which you want to write your temporary gene predictions for genome chunks; and let $augCall be the AUGUSTUS command (e.g.augustus --UTR=on --print_utr=on --AUGUSTUS_CONFIG_PATH=somePath/config --exonnames=on --codingseq=on --species=yourspecies --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --alternatives-from-evidence=true). Furthermore, it can be helful for job monitoring to assign a job script prefix, e.g. myPrefix_ createAugustusJoblist.pl --sequences=chr.lst --wrap="#" --overlap=5000 --chunksize=1252500 \ --outputdir=$augDir/ --joblist=jobs.lst --jobprefix=myPrefix_ --command "$augCall" This creates a number of AUGUSTUS job scripts for sequence segments. The segments overlap, here by 5000 bp. This overlap length should be set at least in the order of the length of a long gene for that species. Otherwise it may happen that a gene is not completely contained in any of the (overlapping) segments. The job scripts are named myPrefix_$i (where $i is the i-th job). Hints file and genome file are added to the AUGUSTUS job from chr.lst. Submitting the AUGUSTUS jobs for ((i=nJobs; i>0; i--)); do qsub -q flex -cwd myPrefix_$i; done If you have a very large number of jobs, you should consider to embed them in an array, instead (array.sh): #!/bin/bash #$ -S /bin/bash #$ -cwd #$ -m e #$ -q flex #$ -t 1-nJobs myPrefix_$SGE_TASK_ID [Of course you should adapt the name flex to your local queue name!] and submit the array with: qsub array.sh Join results After all jobs have finished, you can join the results: cat $augDir/*gff | join_aug_pred.pl > aug.gff Besides concatenating all predictions, this command resolves the redundancy of potential conflicts in the overlapping parts and renumbers the genes sequentially. The ordering is by sequence name and then by position within the sequence. If you do things like No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de |