Tutorial - RNAseq reads with TopHat

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Tutorial - RNAseq reads with TopHat

Post by katharina »

Originally posted in the old forum by Daniel on 02.12.2014 - 08:52

Dear AUGUSTUS team,

I want to refine gene prediction by incorporating raw Illumina paired-end mRNA-seq reads (2x 27 bp) into AUGUSTUS gene prediction (working with a recently sequenced microalga).

I followed the tutorial published on your website:
http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat

I have several questions:

1) reading in the archive (http://bioinf.uni-greifswald.de/bioinf/ ... Prediction), I wondered if I understand it right that I have to run TopHat2 in single-end mode, rather than (intuitively) paired-end mode, although I have paired-end data. Is that correct?
I have two FASTQ files, one for the FW reads, the second for the RV reads. Do I have to merge these two files into a single file, and then run TopHat2 in single-end mode on this combined single file?

2) The script "filterBam" requires the pairedness identifier, i.e. "/1 & /2" or "/f & /r". If these are missing, it runs for ages without yielding any output (or error message; maybe a bug?). So far I used TopHat2 only in paired-end mode, and the pairedness identifier (although I added them to the FASTQ files as described in tutorial step 1) were cleaved away (I tried "...-1 & ...-2" or ".../f & .../r", but it was cleaved away in both cases). I then added "/r" or "/r" manually to the "accepted_hits.bam" file, prior converting it to SAM. This process however is not really straightforward, because I need to map the accepted reads to the complete dataset to get the correct strand information. I wonder if this strategy is correct and if it makes sense at all? I am completely new to this field, so please overlook stupid newbie questions ;)
The alternative would be to skip the "filterBam" step, because the pairedness identifier is only required for this script, is that correct (I didn't find any prerequisite for pairedness in the documentation for "bam2hints")? I wondered if maybe any rewriting of the file occured during "filterBam" - or could I run "bam2hints" directly on the sorted (samtools sort) accepted_hits.bam file? However, you do not recommend skipping the "filterBam" step, don't you?

3) Why is the first output file in step 5 on your tutorial "samtools sort accepted_hits.sf.bam both.ssf" called "both"? This is because the same name is used in step 11 as well.

4) In step 6.1 in your tutorial, the hint parameters are to be set. Where can I find a tutorial with some advice on that and on how to adjust them correctly? Or is your example CFG file given there a general example that should be suitable for most projects?

5) The last step in step 7 and in step 8 of your tutorial, "bowtie" is used, instead of "bowtie2". Is there any argument for using bowie here; or is there any argument for not using bowtie2 instead of bowtie here? Would it make any difference at all?

6) As recommend, I work with the raw reads, i.e. untrimmed. No trimming still applies to TopHat2, and this statement was not restricted to only TopHat1, wasn't it?

I use TopHat2 version "TopHat v2.0.12", and AUGUSTUS version 3.0.3.
Thanks a lot for your effort!
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by katharina on 02.12.2014 - 10:46
The tutorial is written for singleton alignments of paired reads.
However, filterBam now supports an option --pairwiseAlignments . On a paired bam format file, call it as follows:

Code: Select all

filterBam --paired --uniq --in in.bam --out out.bam --pairwiseAlignments
Applying filterBam with the uniq filter can lead to a gain in specificity of hints, I therefore recommend it even for paired bam format. If you use the option --pairwiseAlignments, you don't need to worry about the read names.
There is no particular reason why the file is named both.ssf in step 3. The only important thing is that intron hints are created from the resulting bam file, and that those intron hints are further used to predict genes with AUGUSTUS.
We don't have a tutorial for the adaptation of parameters in step 6.1. The parameters shown there have been adapted to human genes. They might not be optimal for species with less complex genes. I have used them for D. melanogaster and C. elegans, nevertheless, and they yielded results that look ok (although there might be room for improvement). They did not work well for a fungus with mostly single exon genes. In case of doubt, I suggest you run several predictions with different parameter variations and compare them in a browser to decide for a particular parameter set.
There is no reason to work with bowtie in step 7 and 8, I just forgot to update the tutorial, there. I have updated, it now.
We still recommend to work with untrimmed reads - except if you have limited computational resources. The influence of quality trimming of reads on gene prediction is small (using untrimmed reads yields only a tiny little bit more accurate predictions).
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by Daniel on 09.12.2014 - 10:55
hints.gff file is empty
Hi Katharina,
thank you very much for your detailed answer! That did help me a lot.
I have another question:
the resulting "hints.gff" file is empty, i.e. 0 byte. For test purposes, I mapped my reads only against a single scaffold; but the statements given below and the empty hints.gff file also appear when doing it for the whole genome.
Here is what I executed:
"""
bowtie2-build scaffold01162_masked.fasta genome_db
tophat2 -i 22 -I 2792 -r 184 --mate-std-dev 80 -p 4 -o output_dir genome_db FW.fastq,RV.fastq
this yields a 2.0 MB file, accepted_hits.bam.
samtools sort -n accepted_hits.bam accepted_hits.s
the file (magically?!?) increases in size, now taking up 2.8 MB.
/homes/djaeger/mRNA-seq_data-analysis/augustus-3.0.3/auxprogs/filterBam/bin/filterBam --uniq --paired --in accepted_hits.s.bam --out accepted_hits.sf.bam
this file is now smaller than the accepted_hits.s.bam (which I would expect because a filter is applied), but still bigger than the original accepted_hits.bam file.
samtools sort accepted_hits.sf.bam both.ssf
this file is now the smallest, with 1.5 MB.
/homes/djaeger/mRNA-seq_data-analysis/augustus-3.0.3/auxprogs/bam2hints/bam2hints --intronsonly --in=both.ssf.bam --out=hints.gff
the hints.gff file is 0 B, i.e. empty.
"""
Do you have some advice on why the hints.gff file is empty, no matter if I use only a single scaffold, or the whole genome?
Btw, the hints.2.gff file produced later is not empty, but 363 B. It contains these lines:
scaffold01162 b2h intron 778 1040 0 . . pri=4;src=E
scaffold01162 b2h intron 2335 2581 0 . . mult=24;pri=4;src=E
scaffold01162 b2h intron 4283 4556 0 . . mult=7;pri=4;src=E
scaffold01162 b2h intron 7760 8032 0 . . mult=42;pri=4;src=E
scaffold01162 b2h intron 11695 12002 0 . . mult=1503;pri=4;src=E
scaffold01162 b2h intron 12315 12572 0 . . mult=263;pri=4;src=E
Daniel
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by Daniel on 09.12.2014 - 10:58
hints.gff file is empty - output filterBam & bam2hints
The output of filterBam is:
"""
processed line 100001
Processed alignments: 117016
Summary of filtered alignments:
unmapped : 0
percent identity: 0
coverage : 0
not paired (our criterion) : 26264
quantiles of unspliced insert lengths:
q[10%]=114,q[20%]=129,q[30%]=144,q[40%]=161,q[50%]=190,q[60%]=227,q[70%]=398,q[80%]=471,q[90%]=552,
not unique : 0
Cmd line:
/homes/djaeger/mRNA-seq_data-analysis/augustus-3.0.3/auxprogs/filterBam/bin/filterBam --uniq --paired --in accepted_hits.s.bam --out accepted_hits.sf.bam
Elapsed time: 004 seconds.
"""
The output of bam2hints if:
"""
Wait a moment, calculating maximum block size that needs to be allocated... .. done
"""
Daniel
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by katharina on 11.12.2014 - 14:55
It looks like there are no spliced alignments in your bam file. Is it a long scaffold?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by Daniel on 18.12.2014 - 10:54
The scaffold is 16,268 nucleotides long. It contains very highly mostly likely at least one gene with at least two introns. The reads are 2x 27bp.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by katharina on 18.12.2014 - 11:47
Intron hints are generated from spliced alignments. That means in your case that you need at least one of the 27 bp long reads to be aligned in a spliced way by Tophat2.
Do you get any intron hints if you convert the unfiltered bam file to intron hints?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Tutorial - RNAseq reads with TopHat

Post by katharina »

by Daniel on 19.12.2014 - 15:06
Hi Katharina, already thanks a lot for your answers! I will check if I obtain any intron hints with the unfiltered bam file after Christmas and New Year.
Post Reply