Incorporating Illumina RNAseq into AUGUSTUS with Tophat
This document describes a method for structurally annotating a genome based on deep sequencing of a transcriptome, RNA-Seq. The guide below was adapted from a description of the method we initially developed for and applied in the RNA-Seq based Genome Annotation Assessment Project (rGASP) that can be found at http://bioinf.uni-greifswald.de/augustus/binaries/readme.rnaseq.html. For rGASP, Mario used BLAT. Using Tophat, the general approach is to generate intron hints for Augustus from the RNA-Seq, which can be used together with hints from other sources if available (like from an existing gene models, ESTs, protein or genomic conservation, MS/MS). You may adjust the below procedure for your needs, for example, if you choose to first assemble RNA-Seq reads to contigs, you may just just treat them as other larger transcript fragments such as ESTs and use BLAT/GMAP+blat2hints.pl to generate the hints.
Currently, this documentation only comprises the generation of intronhints from RNAseq data with Bowtie/Tophat. Intronhints are the most important hints that can be generated from RNAseq data. According to our experience, intronhints also produce good gene predictions if not UTR model is available, that means intronhints are suitable for most genome annotation projects. Exonparthints are in many cases a valuable addition if a good UTR model exists. Instructions for the generation of exonparthints from a bam file can be found in our GSNAP tutorial at http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP .
You may incorporate RNAseq data with Tophat/Bowtie in a single-step or in a two-step approach. We call the two-step approach 'iterative mapping'. Single-step means that you map RNAseq data against the genome, create hints from the alignments, predict genes with AUGUSTUS with these genes, and you're done. The optional second step may increase the number of reads aligned to a splice site and may therefore increase gene prediction accuracy. For the second step, you need to generate a exon-exon junction file that you provide to Bowtie as a database. After the second round of alignment, you map the alignments back to genome level, create new hints and run AUGUSTUS with the new hints, again. We recommend the two-step method!
Illustration of iterative mapping
Before using Bowtie/Tophat for predicting genes with AUGUSTUS, consider the following advantages and disadvantages:
If you decide to use Bowtie/Tophat, we strongly recommend that you use Bowtie2/Tophat2 because Tophat2 has an increased sensitivity for intron discovery when compared to Tophat(1).
This protocol was tested with the following versions:
Please read the installation instructions of each toolkit carefully and follow their advice!
1. Genome preparation
We recommend that you repeat mask the genome prior mapping RNAseq data to the genome. AUGUSTUS should later be run on the unmasked genome, though. (You should rather include nonexonpart hints generated from repeat regions than running AUGUSTUS on the repeat masked genome.)
2. Read preparation
2.1. Paired read name convention
Since Bowtie/Tophat has the quirk to cleave off /1 and /2 that are commonly used as identifiers of read pairs, and since our filter tools need the pairedness identifier, we recommend that you modify the read names to contain -1 and -2 in the end of the read header to identify read pairs if you want to use paired data. An example of fastq entries that work with our protocol:
@HWUSI-EAS1671_0001:5:1:1022:10290-1 NCCACTTCTACGATGCCATGGATGGGCAGATACAGGGCAGCGTGGAGCAGGCAGGCCAAGGNNNGGNNAAGGAAG +HWUSI-EAS1671_0001:5:1:1022:10290-1 ########################################################################### @HWUSI-EAS1671_0001:5:1:1022:10290-2 NGCAGCTNNNNNAGGACTNNNNNNNNNNGCTNNNNCNNGNNCCTCCTCTGGAGCAAAACATGACCGGCGTTGGGG +HWUSI-EAS1671_0001:5:1:1022:10290-2 ###########################################################################
Since our protocol currently does not support native paired alignments, you do not have to create separate files for the read mates!
2.2. Quality trimming
In our experiments, we obeserved a decrease in gene prediction accuracy if the reads were quality trimmed prior alignment (we tested accepted error rates of 1% and 5%), although Bowtie/Tophat aligns more reads if they were quality trimmed. Therefore, we recommend to work with untrimmed fastq files.
3. Aligning reads with Bowtie/Tophat (step 1)
You first need to build an index file for your genome:
bowtie2-build genome.masked.fa genome_database
Afterwards, align the RNAseq data to the genome, e.g. using 4 processor cores:
tophat2 -p 4 genome_database rnaseq.fastq
4. Filtering raw alignments (step 1)
Tophat produced an bam file at
samtools sort -n output_directory/accepted_hits.bam output_directory/accepted_hits.s # filter alignments with filterBam filterBam --uniq --paired --in output_directory/accepted_hits.s.bam --out output_directory/accepted_hits.sf.bam
Later on, we'll need a sam file header. Now, this header will be extracted and stored in a file:
samtools view -H output_directory/accepted_hits.sf.bam > header.txt
5. Creating intron hints (step 1)
samtools sort output_directory/accepted_hits.sf.bam both.ssf bam2hints --intronsonly --in=both.ssf.bam --out=hints.gff
6. Run Augustus (step 1)
6.1. Set hint parameters (step 1)
Adjust the file
[SOURCES] M RM E W exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005 intron 1 .34 M 1 1e+100 RM 1 1 E 1 1e5 W 1 1 CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1 UTRpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1 nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 W 1 1
[Please note that exonpart, CDSpart and UTRpart hints were not produced, here. This feature might be added to the documentation, later.]
The exonpart malus of .992 means a weak penalty factor for every predicted exonic base that is not supported by any exonpart hints. The exonpart bonus for hints of source W of 1.005 mean that gene structures get this bonus factor for every exonpart hint of multiplicity 1 that is completely included in an exon. Introns that are not supported by any intron hint are penalized by .34, and introns that are supported by RNA-Seq hints are rewarded by a factor of 100,000. The 0.985 are local malus factors. The concept of a local malus was recently introduced to account for different levels of missing information, that becomes more important with with higher general coverage of RNA-Seq. Like the (normal) malus, the local malus also applies to exonic bases that are unsupported by hints. In contract to the normal malus, the local malus only applies to exons, that are well-supported at some region and not supported at another region of the same exon. The following picture illustrates a typical case where the local malus applies.
This UTR exon candidate is very unevenly supported by RNA-Seq coverage and therefore likely to be too long. For every base that is not covered by any exonpart hint a local UTRpart malus factor of 0.985 is applied in addition to the normal exonpart malus of 0.992.
Above parameters have to be seen as an example. The optimal values will depend on various parameters, including
1) Amount of RNA-Seq and percentage of genes expressed in the library. The more transcripts are expressed and covered by RNA-Seq the stronger the malus can be. In the extreme (hypothetical) case, all exonic bases are covered with RNA-Seq and unsupported predicted exonic regions can be punished hard.
2) The stringency of filtering. If, for some other reason, you have chosen to include more than one hit of a read or you allow best alignments also when they are ambiguous, then you may get a significant false positive rate of hints and you may have to reduce bonus factors significantly.
3) Alignment method.
6.2. Run Augustus (step 1)
If you don't want to do iterative alignment, you run AUGUSTUS with the following command and stop following this tutorial here:
augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac genome.fa > aug.out
For iterative alignment, proceed below this box.
augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.gff --allow_hinted_splicesites=atac --introns=on --genemodel=complete genome.fa > aug1.out
In case you generated exonparthints (not shown here), we recommend switching the UTR flag to "on" because RNA-Seq covers UTR as well. With
7. Create an exon-exon junction database (step 2)
All sections starting from here document 'iterative alignment step2' for the Tophat/Bowtie protocol.
We combine the AUGUSTUS introns with introns that were previously identified by RNAseq mapping and create a unique list of introns:
# in case hints.gff also contains other hints than "intron", you need to filter for "intron", first! cat hints.gff aug1.introns.gff | perl -ne '@array = split(/\t/, $_);print "$array:$array-$array\n";'| sort -u > introns.lst
Now, we excise the exon-exon-junctions and write them into fasta format:
Note that intron2exex.pl by default creates a junction library of 150 nt length, i.e. 75 nt from each adjacent exon are joined. If your RNAseq library has a read length that is substantially longer than 75 bp, you should either adapt intron2exex.pl to your read length, or treat the RNAseq data like ESTs (e.g. in case of 454 data) instead of treating it like described in this tutorial.
The resulting file exex.fa looks like this:
>exex246182 75-GTAG-75 chr3.:100000692-100002450 GTTTGGCATCATGGGATGGTATTTTAGACTTGCCAGAACAGAACACTATTCACAAAGATTGCCTGCAGTTTATTGACCAGCTTTCAGTGCCAGAGGAGAAGGCAGCAGAATTACTTTTGGATATTGAATCTGTAATTACCTTTTATTGTA
The following command builds a bowtie database from the exon-exon junction file:
8. Aligning reads with Bowtie (step 2)
At this point, you should work with Bowtie2, instead of calling Bowtie2 via the Tophat2 pipeline, because you don't need to do native spliced alignments since you align reads to exon-exon junction sequences, i.e. the alignments don't have to contain any gaps.
bowtie2 -x your_species_exex1 -U rnaseq.fastq -S bowtie.sam
Discard failed alignments from the output:
samtools view -S -F 4 bowtie.sam > bowtie.F.sam
We need to map the local exex-alignments to global genome level:
9. Join data from step 1 and step 2
Each read has been aligned twice now, to the genome and to the exon-exon sequences. From the read-to-genome alignments we only take the unspliced ones (CIGAR does not contain the letter "N"), as the spliced ones are represented in
# discard intron containing alignments from the original bam file bamtools filter -in output_directory/accepted_hits.bam -out output_directory/accepted_hits.noN.bam -script operation_N_filter.txt # create a bam file with header from the bowtie.global.sam file cat header.txt bowtie.global.sam > bowtie.global.h.sam samtools view -bS -o bowtie.global.h.bam bowtie.global.h.sam # join bam files bamtools merge -in bowtie.global.h.bam -in output_directory/accepted_hits.noN.bam -out both.bam
If bamtools merge does not work for you (there seem to be version specific problems), you can alternatively try using samtools for the same purpose (this suggestion was reported by Philipp Schiffer, thank you!):
samtools merge both.bam bowtie.global.h.bam output_directory/accepted_hits.noN.bam
10. Filtering raw alignments (step 2)
filterBam --uniq --paired --in both.s.bam --out both.sf.bam
11. Creating intron hints (step 2)
samtools sort both.sf.bam both.ssf bam2hints --intronsonly --in=both.ssf.bam --out=hints.2.gff
The creation of exonparthints is described in the GSNAP tutorial at http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.GSNAP
12. Run Augustus (step 2)
augustus --species=yourSpecies --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --hintsfile=hints.2.gff --allow_hinted_splicesites=atac genome.fa > aug2.out
Working with paired bam files
filterBam --paired --uniq --in in.bam --out out.bam --pairwiseAlignments
This is a not a simple replacement command to the above pipeline. You might need to adapt sorting steps.
No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: firstname.lastname@example.org