Reducing average length of gene structure predictions

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Reducing average length of gene structure predictions

Post by katharina »

Originally posted in the old forum by Federico Martinelli on 20.03.2014 - 05:53

I am trying to predict using AUGUSTUS, gene structures (both full and partial), in "intergenomic" regions of Arabidopsis thaliana.
Since this is a well annotated species, I dont expect very long predictions, because those must have already been discovered.
I am using two approaches to gene prediction

a. external hints generated by another user and software, and then fed into AUGUSTUS
b. using protein profile i.e. augustus --ppx

In either case, my predicted genes are in the size range 10K - 30K nt long.
In contrast, the size range for "protein genes" according to TAIR (Arabidopsis resource) is typically around 1K-3K long.
So my intergenic gene structure predictions are artifactually long and need to be made shorter.
So which parameters should I play around with so that my predicted gene structures are reduced to biologically meaningful lengths? And which dataset can I use to benchmark my parameter optimization attempts?

I see from other posts that such issues are not new to me, so I am hoping there are existing fixes to my problem. I have related questions to those posts:

1. http://bioinf.uni-greifswald.de/bioinf/ ... einFusions
http://bioinf.uni-greifswald.de/bioinf/ ... rOfIntrons
Can reduction in malus value be done inside extrinsic hints NOT generated from RNA-Seq data?
My external hints happen to be generated using Selenoprofiles2 by an upstream user.

2. http://bioinf.uni-greifswald.de/bioinf/ ... ortIntrons
Like minimum intron size can be fixed, can maximum intron size also be set? Mario's reply seems to indicate that it can be specified in the SPECIES_intron_probs.pbl file, correct?
If max-length=100, and if sum of probabilities of lengths from 1 - 100 is not zero, will it lead to any strange predictions? Even though Mario says AUGUSTUS will not complain

3. http://bioinf.uni-greifswald.de/bioinf/ ... VsPpxModes
Not sure if using external hints Vs using protein profile based searching will make a difference, it does not seem to in my case

4. http://bioinf.uni-greifswald.de/bioinf/ ... rinsticCfg
Is there a link for how and which lines to alter in extrinsic.cfg for hints etc not generated using RNA-Seq work?
Post Reply