rna-seq

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

rna-seq

Post by katharina »

Originally posted by Assaf in the old forum on 01.05.2013 - 16:40

Hi all,

I see that RNA-Seq supported predictions in augustus become quite accurate by using intron-hints from splice junctions data (similar to what described in http://augustus.gobics.de/binaries/readme.rnaseq.html). The wig hints did not contribute at all in my case.

Still, I see that the program sometimes tends to extend the CDs and exons beyond regions supported by evidence, leading to concatenation adjacent genes, which are not linked by any intron-evidence. I tried to use lower malus values, but this just leads to the increasing this problem ( I expected it will lead to the oposite). Maybe you have an idea how to solve this problem ???

Best,
Assaf
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: rna-seq

Post by katharina »

Originally posted by Mario in the old forum on 21.05.2013 - 16:55

Yes, lower intron malus values should lead to fewer predicted unsupported introns.
I am giving you an excerpt of an extrinsic file below that I recently used.

Code: Select all

        ass        1   1 0.05  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        dss        1   1 0.01  M    1  1e+100  RM  1     1    E 1    1    W 1    1
   exonpart        1      .99  M    1  1e+100  RM  1     1    E 1    1    W 1    1.003
       exon        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
 intronpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
     intron        1       .2  M    1  1e+100  RM  1     1    E 1   50    W 1    1
    CDSpart        1    1 .99  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        CDS        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
    UTRpart        1   1 .985  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        UTR        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
     irpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
nonexonpart        1        1  M    1  1e+100  RM  1     1.15 E 1    1    W 1    1
The intron malus (here 0.2) should remove some of the unsupported introns. Reducing this further, e.g. to 1e-10, this should remove more introns without any hint support. This obviously has limits as then true introns also vanish in the predictions.
More importantly, the local splice site mali (numbers in the third column of rows ass, dss), help more specifically in this respect.
They only apply to candidates splice sites that do no have any hints supporting them (e.g. from intron hints), but the exon that they
flank does have hint support. Therefore, this malus does not apply to unexpressed genes, when using RNA-seq-based hints.
Assuming that you have a UTR model for your species, I would suspect that the exonpart hints from the wiggle file could help as well, in those cases where there are hints in a false positive intron that actually contains two UTRs. You may want to try to increase the exonpart bonus (here 1.003). When varying parameters, do not be shy to try to increase or decrease them a lot.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: rna-seq

Post by katharina »

Originally posted by Jammie in the old forum on 11.10.2013 - 04:44

Hi,all :
When i ran etraining this program
==
./etraining --species=Flower sequence.gb
==
I get the following error "Segmentation fault"
sequence.gb is coming from genebank which file is 72M
When i use another small sequence_small.gb which file is 3.2M
I do not have this problem.
If i want to use the first .gb file (sequence.gb)
How can i fix this error ?
Thanks a lot
Jammie
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: rna-seq

Post by katharina »

Originally posted by katharina in the old forum on 13.10.2013 - 18:14

You could try to split the larger genbank file into smaller files in order to find the location of the problem (it's likely caused by an incorrectly formatted entry). After identification, remove the error causing entry from the original file and run etraining.
Katharina
Post Reply