RNA-Seq based prediction

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

RNA-Seq based prediction

Post by katharina »

Originally posted in the old forum by Assaf on 11.03.2013 - 14:26
Hi there ,
I'm trying to follow the RNA-Seq based prediction, with tophat bam files (http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat).
1) Is it possible/OK to skip the filterBam stage ? (you say it is only "for the sake of completeness").
2) How and if several bam files can be combined in one analysis ? the results refer to only one bam file per analysis
Thanks in advanced,
Assaf
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 12.03.2013 - 12:36
It is definitely possible to skip the filterBam stage. The statement on the wiki page applies to tophat 1. Gene prediction results that we obtained with tophat 1 alignments were not very good, regardless of any filtering step.
Tophat 2 seems to work better, and I recommend you do not skip the filtering step if you use tophat 2, particularly if you have paired data (mate pairs, paired end): bam2hints cannot handle paired bam format, so you need to filter the singleton alignments if you want to consider pairedness as a criterion.
In principle, you could pool all data so that you get a single bam file. Since data sets are usually huge and computational resources an issue, we usually run the analysis for all input files separately and pool the hints.
Best,
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Assaf on 12.03.2013 - 13:30
hi
Dear Katherina,
Thanks a lot,
Just would just like to make it clear to me, because it is not:
Currently, I have fastq paired-end data. For this data, I produced Tophat bam files, with the paired-end mode of Tophat.
So, if I understand correctly, this data cannot be used since it is pair-ended. Therefore I need to run tophat again, after chainging fastq tags /2 /1 to -2 -1, and now in a single-end mode.
Just wondering if I'm correct
Many thanks for you help,
All the best,
Assaf
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 12.03.2013 - 19:04
Exactly as you say.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Sam on 13.03.2013 - 13:19
hi
I also used Tophat2 but skipped filterBam stage as after filtering I could not find the more hints.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 13.03.2013 - 13:41
You should actually get less hints after filtering. They should be more specific after filtering.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Assaf on 13.03.2013 - 14:31
hi
Thanks katharina,
I also thought that in addition to the method we talked about, I would convert the cufflinks gff files with the spliced alignment into the longest peptide sequences, and then blat it and use the mapping results as hints (http://bioinf.uni-greifswald.de/bioinf/ ... porateESTs)
will this be a good idea ? should this be less efficient then the other method ?
Best,
Assaf
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 13.03.2013 - 16:07
It is probably possible to do as you suggest, but you can also simply use the fasta file that is produced by cufflinks and use it in the same way as EST sequences, directly.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Assaf on 18.03.2013 - 17:32
Hi
Hi Katharina and all,
I eventually decided to use gsnap sam files as the input for Augustus RNA-Seq based prediction.
So I started by running gsnap with single end mode, though the data itself is paired ended, as follows:

Code: Select all

cat forward.fastqs reverse.fastqs other.fasqs| gsnap … > my.sam
Then I followed instruction. At the filterBam step I did:

Code: Select all

filterBam --uniq --paired --in "my.ss.bam" --out "my.f.bam"
but the resulting bam file includes only ~25% of the original data (the final bamfilter report is attached below). It reports that most reads are “not paired” ,though the data is based on paired end reads.Maybe you have an idea what went wrong ?
Best,
Assaf

Code: Select all

 processed line 89300001
Summary of filtered alignments:
unmapped : 0 
percent identity: 31845 
coverage : 4552323 
not paired : 54333508 
quantiles of unspliced insert lengths: 
q[10%]=138,q[20%]=164,q[30%]=190,q[40%]=232,q[50%]=354,q[60%]=792,q[70%]=1475,q[80%]=2496,q[90%]=4191, 
unique : 3701953
Cmd line: 
filterBam --uniq --paired --in gsnap_1_se_livHyp_vs_spVii50kbMasked.ss.bam --out gsnap_1_se_livHyp_vs_spVii50kbMasked.f.bam
Elapsed time: 0.000 seconds.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 18.03.2013 - 17:51
I think that you are using short read data. There are many positions in the genome where a short read (that maybe even includes a sequencing error) may align to.
We quite frequently observe a big loss of data when applying the pairedness filter. It nevertheless has a positive effect on gene prediction accuracy.
What you could try to get a feeling about the number of expected read pairs for your data is comparing the output of filterBam to the output of a paired alignment run.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Assaf on 18.03.2013 - 21:16
Thanks a lot
Thats a good idea,
Another related thing on the gsnap-based prediction:
The toturial suggests using bam2hints for getting intron hints only, and sending these hints to Augustus. But how Augustus .cfg extrinsic file should look in such case ? should I only allow malus and bonus for the intron, and put 1 in the fields of all other gene parts ?
Best and thanks,
Assaf
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 19.03.2013 - 10:36
You could use the default extrinsic.M.RM.E.W.cfg (given that you did not rename the sources in your hint file).
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Sammy on 02.07.2013 - 10:52
accuracy
This was a nice discussion. Could someone tell me how to check the enhanced prediction accuracy after adding the hints. I can check the improvement in prediction accuracy while using optimized parameters. When I used hints I am not able to check the enhanced accuracy
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 09.07.2013 - 15:24
How do you try to check the accuracy? Maybe post your command for checking accuracy with your optimized parameters, and for comparison your command with the hints?
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Sammy on 09.07.2013 - 18:08
I still have problem with low accuracy at gene level. I think first I should do something to get high accuracy at gene level then should add the hints to do prediction again. I used the command --
command line:

Code: Select all

augustus --species=msad --extrinsicCfgFile=msad_new.cfg --alternatives-from-evidence=true --hintsfile=hints.rseqhints.gff --allow_hinted_splicesites=atac --introns=on --genemodel=complete msad_scaffolds.fa
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by katharina on 09.07.2013 - 22:06
In order to get automated gene prediction accuracy value computation from augustus, you must run augustus on a genbank file, not on a fasta format file (as shown in your case above).
Beware, if you prepared the hints as instructed in our wikis, they might not fit your genbank file!
Another solution is to resort to other scripts for computing gene prediction accuracy, e.g. the eval package http://www.biomedcentral.com/1471-2105/4/50
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Sammy on 10.07.2013 - 00:43
Yes I prepared the hints as suggested by Augustus wiki. I afraid to obtain genebank file for compete genome of my organism. I will check the eval package also. Thanks
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Andrew on 17.09.2013 - 16:58
problem with intron prediction with tophat2
I'm using Tophat2 with your instructions at:
http://bioinf.uni-greifswald.de/bioinf/ ... seq.Tophat
and everything works OK until the command:

Code: Select all

cat hints.gff aug1.introns.gff | perl -ne 'split; print "@_[0]:@_[3]-@_[4] 
";' | sort -u > introns.lst
The resulting introns.lst file is empty although the hints.gff file is over 9Mb and the other file is over 26Mb. Could there be something in my Unix or Perl system that is different from the one used in the example on the webpage? If you don't know exactly what is wrong could you give me an example of the resulting introns.lst file so that I could write my own command?
Many thanks,
Andrew
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by TW on 18.09.2013 - 16:07
@Andrew
As far as I can figure it out the perl command:

Code: Select all

print "@_[0]:@_[3]-@_[4]";
prints per line
column1:column4-column5
of the input stream where the symbols ":" and "-" are to be included.
The following "sort -u" sorts the resulting stream and assures uniqueness of lines in the final output.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: RNA-Seq based prediction

Post by katharina »

by Andrew on 19.09.2013 - 11:18
Thanks TW. I ended up writing my own script producing a file with the format:
column1:column4-column5
and it was accepted in subsequent steps
Post Reply