AUGUSTUS Forum

Posted: **Fri Nov 20, 2015 12:59 pm**

Originally posted in the old forum by mark on 26.03.2013 - 16:56

I have Illumina reads in _1.fastq and _2.fastq format, but they don't have /1 and /2 in the identifiers. And, note that I only have 1 row with identifiers (starting with @, those starting with + are removed, probably because its redundant).

Code: Select all

$ bzcat f_1.fastq.bz2 | head -n 8 
@HWI-ST486:365:C16E0ACXX:3:1101:1382:1950 1:N:0:CCGTCC 
NGAAATCATCACCGAAGAAGTCACCAAGTCTGACTTGAAACAATTGGTTGG 
+
1=DDDDDDFFDDIIDIIIIB9CFEIDEEEIIIIIIIIIIIIIIIEIEIII
@HWI-ST486:365:C16E0ACXX:3:1101:1451:1958 1:N:0:CCGTCC 
NTTGATTTTAAATCAGCCGTAGTTACATGTCTGGTCGAATCTTCGGTACAT 
+
1=DDDFFHHHHHJJIJJJJJJJJJIJJJIJIJIIJJIJJJIJHFGGGIII
$ bzcat f_2.fastq.bz2 | head -n 8 
@HWI-ST486:365:C16E0ACXX:3:1101:1382:1950 2:N:0:CCGTCC 
GGCTTCTTCAATACCTTAACCTTGCGGATGTAGACATCGTGCAATGGGTAG 
+ 
@CCFFDFFHHHHHDHGIJGIJGGIGIJDGHADG@FHGIIJIIIHIJGIFFB 
@HWI-ST486:365:C16E0ACXX:3:1101:1451:1958 2:N:0:CCGTCC 
GAACGTTTGCAGTATACCCGTGATTGCATTTGCTTGGATTTTTGTCCTGAA 
+ 
@@CFFFFFHHHHHIJIJJJIHJGIJIIGIJJJJJJJIHJIJJJJJJIIJIG

How should I reformat these so that I can use it for hints with AUGUSTUS:
http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat
Thanks in advance, Mark

Posted: **Fri Nov 20, 2015 12:59 pm**

by katharina on 27.03.2013 - 12:25
I suggest you simply chop off everything in the read headers (the 1st, 5th, 9th, ... row that begins with an @) that occurs after the first space, and add a pairedness indicating ending.
Before deciding on a suitable ending for entire file, I suggest you build yourself a toy dataset to check in the output of tophat whether the pair-indicating ending gets cleaved off in the current version, or not. If it gets cleaved off, choose a different pair-indicating string that does not get deleted in the output, and later replace it in the output by one of the endings that is accepted by our tools (/1,/2 or -1,-2).
Katharina

AUGUSTUS Forum

how to correctly use RNA-seq data

how to correctly use RNA-seq data

Re: how to correctly use RNA-seq data