Tutorial - RNAseq reads with TopHat
Posted: Thu Nov 19, 2015 7:38 pm
Originally posted in the old forum by Daniel on 02.12.2014 - 08:52
Dear AUGUSTUS team,
I want to refine gene prediction by incorporating raw Illumina paired-end mRNA-seq reads (2x 27 bp) into AUGUSTUS gene prediction (working with a recently sequenced microalga).
I followed the tutorial published on your website:
http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat
I have several questions:
1) reading in the archive (http://bioinf.uni-greifswald.de/bioinf/ ... Prediction), I wondered if I understand it right that I have to run TopHat2 in single-end mode, rather than (intuitively) paired-end mode, although I have paired-end data. Is that correct?
I have two FASTQ files, one for the FW reads, the second for the RV reads. Do I have to merge these two files into a single file, and then run TopHat2 in single-end mode on this combined single file?
2) The script "filterBam" requires the pairedness identifier, i.e. "/1 & /2" or "/f & /r". If these are missing, it runs for ages without yielding any output (or error message; maybe a bug?). So far I used TopHat2 only in paired-end mode, and the pairedness identifier (although I added them to the FASTQ files as described in tutorial step 1) were cleaved away (I tried "...-1 & ...-2" or ".../f & .../r", but it was cleaved away in both cases). I then added "/r" or "/r" manually to the "accepted_hits.bam" file, prior converting it to SAM. This process however is not really straightforward, because I need to map the accepted reads to the complete dataset to get the correct strand information. I wonder if this strategy is correct and if it makes sense at all? I am completely new to this field, so please overlook stupid newbie questions
The alternative would be to skip the "filterBam" step, because the pairedness identifier is only required for this script, is that correct (I didn't find any prerequisite for pairedness in the documentation for "bam2hints")? I wondered if maybe any rewriting of the file occured during "filterBam" - or could I run "bam2hints" directly on the sorted (samtools sort) accepted_hits.bam file? However, you do not recommend skipping the "filterBam" step, don't you?
3) Why is the first output file in step 5 on your tutorial "samtools sort accepted_hits.sf.bam both.ssf" called "both"? This is because the same name is used in step 11 as well.
4) In step 6.1 in your tutorial, the hint parameters are to be set. Where can I find a tutorial with some advice on that and on how to adjust them correctly? Or is your example CFG file given there a general example that should be suitable for most projects?
5) The last step in step 7 and in step 8 of your tutorial, "bowtie" is used, instead of "bowtie2". Is there any argument for using bowie here; or is there any argument for not using bowtie2 instead of bowtie here? Would it make any difference at all?
6) As recommend, I work with the raw reads, i.e. untrimmed. No trimming still applies to TopHat2, and this statement was not restricted to only TopHat1, wasn't it?
I use TopHat2 version "TopHat v2.0.12", and AUGUSTUS version 3.0.3.
Thanks a lot for your effort!
Dear AUGUSTUS team,
I want to refine gene prediction by incorporating raw Illumina paired-end mRNA-seq reads (2x 27 bp) into AUGUSTUS gene prediction (working with a recently sequenced microalga).
I followed the tutorial published on your website:
http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat
I have several questions:
1) reading in the archive (http://bioinf.uni-greifswald.de/bioinf/ ... Prediction), I wondered if I understand it right that I have to run TopHat2 in single-end mode, rather than (intuitively) paired-end mode, although I have paired-end data. Is that correct?
I have two FASTQ files, one for the FW reads, the second for the RV reads. Do I have to merge these two files into a single file, and then run TopHat2 in single-end mode on this combined single file?
2) The script "filterBam" requires the pairedness identifier, i.e. "/1 & /2" or "/f & /r". If these are missing, it runs for ages without yielding any output (or error message; maybe a bug?). So far I used TopHat2 only in paired-end mode, and the pairedness identifier (although I added them to the FASTQ files as described in tutorial step 1) were cleaved away (I tried "...-1 & ...-2" or ".../f & .../r", but it was cleaved away in both cases). I then added "/r" or "/r" manually to the "accepted_hits.bam" file, prior converting it to SAM. This process however is not really straightforward, because I need to map the accepted reads to the complete dataset to get the correct strand information. I wonder if this strategy is correct and if it makes sense at all? I am completely new to this field, so please overlook stupid newbie questions
The alternative would be to skip the "filterBam" step, because the pairedness identifier is only required for this script, is that correct (I didn't find any prerequisite for pairedness in the documentation for "bam2hints")? I wondered if maybe any rewriting of the file occured during "filterBam" - or could I run "bam2hints" directly on the sorted (samtools sort) accepted_hits.bam file? However, you do not recommend skipping the "filterBam" step, don't you?
3) Why is the first output file in step 5 on your tutorial "samtools sort accepted_hits.sf.bam both.ssf" called "both"? This is because the same name is used in step 11 as well.
4) In step 6.1 in your tutorial, the hint parameters are to be set. Where can I find a tutorial with some advice on that and on how to adjust them correctly? Or is your example CFG file given there a general example that should be suitable for most projects?
5) The last step in step 7 and in step 8 of your tutorial, "bowtie" is used, instead of "bowtie2". Is there any argument for using bowie here; or is there any argument for not using bowtie2 instead of bowtie here? Would it make any difference at all?
6) As recommend, I work with the raw reads, i.e. untrimmed. No trimming still applies to TopHat2, and this statement was not restricted to only TopHat1, wasn't it?
I use TopHat2 version "TopHat v2.0.12", and AUGUSTUS version 3.0.3.
Thanks a lot for your effort!