Bioinformatics Greifswald | Augustus / ParallelPred

Run AUGUSTUS predictions parallel

The following instructions describe how to break down a gene prediction problem for a large genome into small jobs that can be run on a compute server or cluster via the Sun Grid Engine.

Genome file and chr.lst preparation

At first, you should split the genome file into single fasta entry files:

splitMfasta.pl genome.fa --outputpath=outputPathPrefix

Then you can rename the split single fasta files according to their fasta names:

for f in genome.split.*; do NAME=`grep ">" $f`; mv $f ${NAME#>}.fa; done

In addition, you need to run obtain the length of each fasta entry. You can do this in many different ways, but the script

summarizeACGTcontent.pl genome.fa > summary.out

may be helpful for this. It produces an output similar to this:

15072423 bases. chrI BASE COUNT     4835938 a   2695881 c  2692152 g   4848452 t
15279345 bases. chrII BASE COUNT     4878197 a   2769219 c  2762213 g   4869716 t
13783700 bases. chrIII BASE COUNT     4444651 a   2449148 c  2466334 g   4423567 t
17493793 bases. chrIV BASE COUNT     5711039 a   3034770 c  3017014 g   5730970 t
13794 bases.    chrM BASE COUNT     4335 a   1225 c  2055 g   6179 t
20924149 bases. chrV BASE COUNT     6750394 a   3712061 c  3701398 g   6760296 t
17718866 bases. chrX BASE COUNT     5747199 a   3119708 c  3117874 g   5734085 t
summary: BASE COUNT     32371753 a   17782012 c  17759040 g   32373265 t
total 100286070bp
gc: 0.35439669736784%

You need the fasta name and the length, so e.g. grep all lines with the keyword base. In addition, you'll need to specify the path to the single fasta entry files, and the hints file for your AUGUSTUS predictions. Let's assume that the path to your single fasta entry genome files is $outputPathPrefix and that the hints file $anotherPath/hints.gff (in this example you did not further split the hints file for the separate genome entries, but you may as well do that in addition to reduce network traffic and runtime):

grep "bases" summary.out | perl -pe 's/^(\d+) bases.\s+(\S+) \
  BASE.*/$outputPathPrefix/$2\.fa\t$anotherPath\/hints.gff\t1\t$1/;' > chr.lst

The resulting file chr.lst should look like this:

$outputPathPrefix/chrI.fa      $anotherPath/hints.gff   1 15072423
$outputPathPrefix/chrII.fa     $anotherPath/hints.gff   1       15279345
$outputPathPrefix/chrIII.fa    $anotherPath/hints.gff   1       13783700
$outputPathPrefix/chrIV.fa     $anotherPath/hints.gff   1       17493793
$outputPathPrefix/chrM.fa      $anotherPath/hints.gff   1       13794
$outputPathPrefix/chrV.fa      $anotherPath/hints.gff   1       20924149
$outputPathPrefix/chrX.fa      $anotherPath/hints.gff   1       17718866

You omit the second column from chr.lst if you do not have any hints (ab initio case):

$outputPathPrefix/chrI.fa      1 15072423
$outputPathPrefix/chrII.fa     1       15279345
$outputPathPrefix/chrIII.fa    1       13783700
$outputPathPrefix/chrIV.fa     1       17493793
$outputPathPrefix/chrM.fa      1       13794
$outputPathPrefix/chrV.fa      1       20924149
$outputPathPrefix/chrX.fa      1       17718866

Creating the AUGUSTUS jobs

Let $augDir be the directory to which you want to write your temporary gene predictions for genome chunks; and let $augCall be the AUGUSTUS command (e.g.augustus --UTR=on --print_utr=on --AUGUSTUS_CONFIG_PATH=somePath/config --exonnames=on --codingseq=on --species=yourspecies --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --alternatives-from-evidence=true). Furthermore, it can be helful for job monitoring to assign a job script prefix, e.g. myPrefix_

createAugustusJoblist.pl --sequences=chr.lst --wrap="#" --overlap=5000 --chunksize=1252500 \
  --outputdir=$augDir/ --joblist=jobs.lst --jobprefix=myPrefix_ --command "$augCall"

This creates a number of AUGUSTUS job scripts for sequence segments. The segments overlap, here by 5000 bp. This overlap length should be set at least in the order of the length of a long gene for that species. Otherwise it may happen that a gene is not completely contained in any of the (overlapping) segments. The job scripts are named myPrefix_$i (where $i is the i-th job). Hints file and genome file are added to the AUGUSTUS job from chr.lst.

Submitting the AUGUSTUS jobs

for ((i=nJobs; i>0; i--)); do qsub -q flex -cwd myPrefix_$i; done

If you have a very large number of jobs, you should consider to embed them in an array, instead (array.sh):

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m e
#$ -q flex
#$ -t 1-nJobs
myPrefix_$SGE_TASK_ID

[Of course you should adapt the name flex to your local queue name!]

and submit the array with:

qsub array.sh

Join results

After all jobs have finished, you can join the results:

cat $augDir/*gff | join_aug_pred.pl > aug.gff

Besides concatenating all predictions, this command resolves the redundancy of potential conflicts in the overlapping parts and renumbers the genes sequentially. The ordering is by sequence name and then by position within the sequence.

If you do things like cat *gff | someCommand.pl > some.gff, and if the input files are large, such commands can cause problems, because in that case, the gff input files are concatenated in an iterative way. As soon as the some.gff output file exists, it will also be concatenated into the input! In case of doubt, use different file endings for intput and output, or store the concatenated results in a temporary file, first.

No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de