Recent Changes - Search:





Data Privacy Protection

About PmWiki

edit SideBar

Incorporating Illumina RNAseq into AUGUSTUS with Tophat

This document describes a method for structurally annotating a genome based on deep sequencing of a transcriptome, RNA-Seq. The guide below was adapted from a description of the method we initially developed for and applied in the RNA-Seq based Genome Annotation Assessment Project (rGASP) that can be found at For rGASP, Mario used BLAT. Using Tophat, the general approach is to generate intron hints for Augustus from the RNA-Seq, which can be used together with hints from other sources if available (like from an existing gene models, ESTs, protein or genomic conservation, MS/MS). You may adjust the below procedure for your needs, for example, if you choose to first assemble RNA-Seq reads to contigs, you may just just treat them as other larger transcript fragments such as ESTs and use BLAT/ to generate the hints.

Currently, this documentation only comprises the generation of intronhints from RNAseq data with Bowtie/Tophat. Intronhints are the most important hints that can be generated from RNAseq data. According to our experience, intronhints also produce good gene predictions if not UTR model is available, that means intronhints are suitable for most genome annotation projects. Exonparthints are in many cases a valuable addition if a good UTR model exists. Instructions for the generation of exonparthints from a bam file can be found in our GSNAP tutorial at .

You may incorporate RNAseq data with Tophat/Bowtie in a single-step or in a two-step approach. We call the two-step approach 'iterative mapping'. Single-step means that you map RNAseq data against the genome, create hints from the alignments, predict genes with AUGUSTUS with these genes, and you're done. The optional second step may increase the number of reads aligned to a splice site and may therefore increase gene prediction accuracy. For the second step, you need to generate a exon-exon junction file that you provide to Bowtie as a database. After the second round of alignment, you map the alignments back to genome level, create new hints and run AUGUSTUS with the new hints, again. We recommend the two-step method!

Illustration of iterative mapping
Illustration of iterative mapping

Before using Bowtie/Tophat for predicting genes with AUGUSTUS, consider the following advantages and disadvantages:

  • Compared to BLAT or GSNAP, we observed that Bowtie/Tophat tends to align fewer RNAseq reads, which results in a lower number of intron hints and mildly reduced gene prediction sensitivity. On the other hand, intron hints created with Bowtie/Tophat are very specific.
  • Compared to GSNAP, Tophat is slower.
  • Bowtie/Tophat aligns more reads if they were quality trimmed (e.g. to 5% accepted error rate).
  • The output of Bowtie/Tophat encompasses an entire directory of files rather than one alignment file, which makes data handling a tiny bit more complicated.

If you decide to use Bowtie/Tophat, we strongly recommend that you use Bowtie2/Tophat2 because Tophat2 has an increased sensitivity for intron discovery when compared to Tophat(1).

Software requirements:

This protocol was tested with the following versions:

Please read the installation instructions of each toolkit carefully and follow their advice!

1. Genome preparation

We recommend that you repeat mask the genome prior mapping RNAseq data to the genome. AUGUSTUS should later be run on the unmasked genome, though. (You should rather include nonexonpart hints generated from repeat regions than running AUGUSTUS on the repeat masked genome.)

2. Read preparation

2.1. Paired read name convention

Since Bowtie/Tophat has the quirk to cleave off /1 and /2 that are commonly used as identifiers of read pairs, and since our filter tools need the pairedness identifier, we recommend that you modify the read names to contain -1 and -2 in the end of the read header to identify read pairs if you want to use paired data. An example of fastq entries that work with our protocol:


Since our protocol currently does not support native paired alignments, you do not have to create separate files for the read mates!

2.2. Quality trimming

In our experiments, we obeserved a decrease in gene prediction accuracy if the reads were quality trimmed prior alignment (we tested accepted error rates of 1% and 5%), although Bowtie/Tophat aligns more reads if they were quality trimmed. Therefore, we recommend to work with untrimmed fastq files.

3. Aligning reads with Bowtie/Tophat (step 1)

You first need to build an index file for your genome:

bowtie2-build genome.masked.fa genome_database

Afterwards, align the RNAseq data to the genome, e.g. using 4 processor cores:

tophat2 -p 4 genome_database rnaseq.fastq

4. Filtering raw alignments (step 1)

Tophat produced an bam file at output_directory/accepted_hits.bam. For BLAT, we originally developed the paired and uniq filter. For BLAT, this filter improves gene prediction accuracy. However, we could not find an improvement in gene prediction accuracy if we filtered Tophat alignments according to that protocol. However, for the sake of completeness, the following shows the paired and uniq option in filterBam.

samtools sort -n output_directory/accepted_hits.bam output_directory/accepted_hits.s
# filter alignments with filterBam
filterBam --uniq --paired --in output_directory/accepted_hits.s.bam --out output_directory/accepted_hits.sf.bam

The option --paired should be omitted if singleton RNAseq data was used!

Later on, we'll need a sam file header. Now, this header will be extracted and stored in a file:

samtools view -H output_directory/accepted_hits.sf.bam > header.txt

5. Creating intron hints (step 1)

samtools sort output_directory/accepted_hits.sf.bam both.ssf 
bam2hints --intronsonly --in=both.ssf.bam --out=hints.gff

6. Run Augustus (step 1)

6.1. Set hint parameters (step 1)

Adjust the file extrinsic.cfg that holds the hint parameters. Start by copying the file config/extrinsic/extrinsic.M.RM.E.W.cfg. In the example above the relevant, non-neutral lines could look like this:


exonpart    1   .992    M 1 1e+100 RM 1 1    E 1 1   W 1 1.005
intron      1   .34     M 1 1e+100 RM 1 1    E 1 1e5 W 1 1
CDSpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
UTRpart     1   1 0.985 M 1 1e+100 RM 1 1    E 1 1   W 1 1
nonexonpart 1   1       M 1 1e+100 RM 1 1.01 E 1 1   W 1 1

[Please note that exonpart, CDSpart and UTRpart hints were not produced, here. This feature might be added to the documentation, later.]

The exonpart malus of .992 means a weak penalty factor for every predicted exonic base that is not supported by any exonpart hints. The exonpart bonus for hints of source W of 1.005 mean that gene structures get this bonus factor for every exonpart hint of multiplicity 1 that is completely included in an exon. Introns that are not supported by any intron hint are penalized by .34, and introns that are supported by RNA-Seq hints are rewarded by a factor of 100,000. The 0.985 are local malus factors. The concept of a local malus was recently introduced to account for different levels of missing information, that becomes more important with with higher general coverage of RNA-Seq. Like the (normal) malus, the local malus also applies to exonic bases that are unsupported by hints. In contract to the normal malus, the local malus only applies to exons, that are well-supported at some region and not supported at another region of the same exon. The following picture illustrates a typical case where the local malus applies.

This UTR exon candidate is very unevenly supported by RNA-Seq coverage and therefore likely to be too long. For every base that is not covered by any exonpart hint a local UTRpart malus factor of 0.985 is applied in addition to the normal exonpart malus of 0.992.

Above parameters have to be seen as an example. The optimal values will depend on various parameters, including

1) Amount of RNA-Seq and percentage of genes expressed in the library. The more transcripts are expressed and covered by RNA-Seq the stronger the malus can be. In the extreme (hypothetical) case, all exonic bases are covered with RNA-Seq and unsupported predicted exonic regions can be punished hard.

2) The stringency of filtering. If, for some other reason, you have chosen to include more than one hit of a read or you allow best alignments also when they are ambiguous, then you may get a significant false positive rate of hints and you may have to reduce bonus factors significantly.

3) Alignment method.

6.2. Run Augustus (step 1)

If you don't want to do iterative alignment, you run AUGUSTUS with the following command and stop following this tutorial here:

augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac genome.fa > aug.out

For iterative alignment, proceed below this box.

augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac --introns=on --genemodel=complete genome.fa > aug1.out

In case you generated exonparthints (not shown here), we recommend switching the UTR flag to "on" because RNA-Seq covers UTR as well. With --UTR=off, the exonpart hints would be (mis)interpreted to be hints for coding parts of exons. If you did not create exonparthints and if you don't have UTR parameters, you can safely omitt the UTR flag. --allow_hinted_splicesites=atac allows Augustus to predict the (rare) introns that start with AT and end with AC in addition to the GT-AG and GC-AG introns that are allowed by default. --alternatives-from-evidence=true turns on the prediction of alternative splicing. For this, intron hints are in particular informative, as exonpart hints alone will (if unstranded) not yield alternative transcripts.

7. Create an exon-exon junction database (step 2)

All sections starting from here document 'iterative alignment step2' for the Tophat/Bowtie protocol.

cat aug1.out | tee aug.prelim.gff | grep -P "\tintron\t" > aug1.introns.gff

We combine the AUGUSTUS introns with introns that were previously identified by RNAseq mapping and create a unique list of introns:

# in case hints.gff also contains other hints than "intron", you need to filter for "intron", first!
cat hints.gff aug1.introns.gff | perl -ne '@array = split(/\t/, $_);print "$array[0]:$array[3]-$array[4]\n";'| sort -u > introns.lst

Now, we excise the exon-exon-junctions and write them into fasta format: --introns=introns.lst --seq=genome.masked.fa --exex=exex.fa --map=map.psl

Note that by default creates a junction library of 150 nt length, i.e. 75 nt from each adjacent exon are joined. If your RNAseq library has a read length that is substantially longer than 75 bp, you should either adapt to your read length, or treat the RNAseq data like ESTs (e.g. in case of 454 data) instead of treating it like described in this tutorial.

The resulting file exex.fa looks like this:

>exex246182 75-GTAG-75 chr3.:100000692-100002450

The following command builds a bowtie database from the exon-exon junction file:

bowtie2-build exex.fa your_species_exex1

8. Aligning reads with Bowtie (step 2)

At this point, you should work with Bowtie2, instead of calling Bowtie2 via the Tophat2 pipeline, because you don't need to do native spliced alignments since you align reads to exon-exon junction sequences, i.e. the alignments don't have to contain any gaps.

bowtie2 -x your_species_exex1 -U rnaseq.fastq -S bowtie.sam

Discard failed alignments from the output:

samtools view -S -F 4 bowtie.sam > bowtie.F.sam

We need to map the local exex-alignments to global genome level: bowtie.F.sam map.psl >

9. Join data from step 1 and step 2

Each read has been aligned twice now, to the genome and to the exon-exon sequences. From the read-to-genome alignments we only take the unspliced ones (CIGAR does not contain the letter "N"), as the spliced ones are represented in

# discard intron containing alignments from the original bam file
bamtools filter -in output_directory/accepted_hits.bam -out output_directory/accepted_hits.noN.bam -script operation_N_filter.txt
# create a bam file with header from the file
cat header.txt >
samtools view -bS -o
# join bam files
bamtools merge -in -in output_directory/accepted_hits.noN.bam -out both.bam

If bamtools merge does not work for you (there seem to be version specific problems), you can alternatively try using samtools for the same purpose (this suggestion was reported by Philipp Schiffer, thank you!):

samtools merge both.bam output_directory/accepted_hits.noN.bam


samtools sort -n both.bam both.s

10. Filtering raw alignments (step 2)

filterBam --uniq --paired --in both.s.bam --out both.sf.bam

11. Creating intron hints (step 2)

samtools sort both.sf.bam both.ssf
bam2hints --intronsonly --in=both.ssf.bam --out=hints.2.gff

The creation of exonparthints is described in the GSNAP tutorial at

12. Run Augustus (step 2)

augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.2.gff --allow_hinted_splicesites=atac genome.fa > aug2.out

Working with paired bam files

filterBam now supports paired bam files (i.e. the output files of a paired alignment instead of two singleton alignments). Example call:

filterBam --paired --uniq --in in.bam --out out.bam --pairwiseAlignments 

This is a not a simple replacement command to the above pipeline. You might need to adapt sorting steps.

No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact:

Edit - History - Print - Recent Changes - Search
Page last modified on December 02, 2014, at 10:44 AM