Reducing average length of gene structure predictions

Post by katharina »

Originally posted in the old forum by Federico Martinelli on 20.03.2014 - 05:53

I am trying to predict using AUGUSTUS, gene structures (both full and partial), in "intergenomic" regions of Arabidopsis thaliana.
Since this is a well annotated species, I dont expect very long predictions, because those must have already been discovered.
I am using two approaches to gene prediction

a. external hints generated by another user and software, and then fed into AUGUSTUS
b. using protein profile i.e. augustus --ppx

In either case, my predicted genes are in the size range 10K - 30K nt long.
In contrast, the size range for "protein genes" according to TAIR (Arabidopsis resource) is typically around 1K-3K long.
So my intergenic gene structure predictions are artifactually long and need to be made shorter.
So which parameters should I play around with so that my predicted gene structures are reduced to biologically meaningful lengths? And which dataset can I use to benchmark my parameter optimization attempts?

I see from other posts that such issues are not new to me, so I am hoping there are existing fixes to my problem. I have related questions to those posts:

1. ... einFusions ... rOfIntrons
Can reduction in malus value be done inside extrinsic hints NOT generated from RNA-Seq data?
My external hints happen to be generated using Selenoprofiles2 by an upstream user.

2. ... ortIntrons
Like minimum intron size can be fixed, can maximum intron size also be set? Mario's reply seems to indicate that it can be specified in the SPECIES_intron_probs.pbl file, correct?
If max-length=100, and if sum of probabilities of lengths from 1 - 100 is not zero, will it lead to any strange predictions? Even though Mario says AUGUSTUS will not complain

3. ... VsPpxModes
Not sure if using external hints Vs using protein profile based searching will make a difference, it does not seem to in my case

4. ... rinsticCfg
Is there a link for how and which lines to alter in extrinsic.cfg for hints etc not generated using RNA-Seq work?
