Bioinformatics Greifswald | Augustus / PacBioGMAP

Below I describe the protocol I used to identify new mouse isoforms using single molecule Pacific Bioscience (PacBio) RNA-Seq reads. For clarity these command lines are without parallel execution. However, in reality I split the mouse genome by chromosomes and ran alignments in parallel. The input were circular consensus sequences (ccs), that often constitute near-full length transcripts. The reads were not corrected using other short read data, although I suspect this may improve mapping accuracy. In this example, the input genome is softmasked, i.e. in mouse_genome.fa the nucleotides within repeats are written lower case (a,c,g,t), the other bases in upper case.

PacBio ccs reads aligned with GMAP and AUGUSTUS predictions.

1. Align the reads with GMAP (Thomas Wu)

mkdir gindex
gmap_build -D gindex/ -d mm10 mouse_genome.fa
gmap -D gindex/ -d mm10 pacbio_rnaseq.fa --min-intronlength=30 --intronlength=500000 --trimendexons=20 -f 1 -n 0 > gmap.psl

The option -n 0 to gmap lets it report only the alignment(s) with maximal score, no suboptimal alignments.

2. Make hints for AUGUSTUS from alignments

cat gmap.psl | sort -n -k 16,16 | sort -s -k 14,14 | perl -ne '@f=split; print if ($f[0]>=100)' | blat2hints.pl --source=PB --nomult --ep_cutoff=20 --in=/dev/stdin --out=gmap.pacbio.hints.gff

In this example, I threw away alignments with less than 100 matches, just because the aim was to find new isoforms with some confidence.

When you have multiple libraries, I recommend to align them separately and keep hints separately. You can concatenate these hints later with cat hints.lib1.gff hints.lib2.gff > hints.gff . These hints can be concatenated with hints from other evidence sources, e.g. from Illumina short RNA-Seq reads.

3. Predict genes genome-wide with AUGUSTUS

augustus --species=human --alternatives-from-evidence=1 --UTR=on --extrinsicCfgFile=extrinsic.M.RM.PB.cfg --hintsfile=hints.gff --softmasking=1 mouse_genome.fa  > augustus.gff

The file extrinsic.M.RM.PB.cfg contains parameters for trusting the PacBio (BP) hints and implicit hints from the lower case soft masking. If you have hints from other evidence sources, you will have to create another extrinsic.cfg file with one additional column set (3 columns) for each additional evidence source.

Remarks for parallel execution: If you split the genome into many chunks, for example to run it on a cluster, it is advisable to split the hints.gff file accordingly. It is usually OK, to split the hints by chromosome only, even if you split the chromosomes by themselves into smaller regions. After all augustus runs finished, you can make a single file with consecutive (unique) gene numbering with cat aug.*.gff | join_aug_pred.pl > augustus.gff . The script join_aug_pred.pl is in the augustus scripts folder and also cleans up conflicts if you ran augustus on overlapping 'tiles' of each chromosome.