Transcript sensitivity/specificity question

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Transcript sensitivity/specificity question

Post by katharina »

Originally posted in the old forum by MHoeppner on 30.04.2014 - 08:16
Hi,
I have been struggling with augustus to generate a model that has acceptable specificity/sensitivity on the transcript level. I followed the tutorial on building a new model, but feel that I must be missing something...
My approach:
Generated an evidence based annotation with Maker 2.31 for Populus trichocarpa using curated reference proteins for the species from Uniprot and available EST data from EMBL
Selected models that had an AED score of <= 0.2 and a predicted protein product of >= 70 AA (basically in good agreement with all data)
Manually selected models that were in good agreement with the evidence alignments, had a proper start/stop and at least 2kb distance to neighboring genes (since I wanted to include 2kb flanks when creating the GenBank file)
No UTRs were annotated/used
Protein products of all models were reciprocally blasted to check for potential clusters/homologs (none were found)
For now, I selected 400 models, converted them to GenBank format using the supplied script (gff2gbsmall)and split them into 300 models for training, 100 for testing. This set is obviously a bit on the small side, but I would hope that it gives a first impression.
The results against the test set are quoted below. This is basically how all my attempts at training augustus have ended. Good nucleotide level values, decent exon-level values and basically no luck for the transcript level.
My question: What can I do to improve that? I can see from some of there included models that much higher values are possible, it just isn't clear to me what was done to achieve that quality.

Code: Select all

| sensitivity | specificity |
|
nucleotide level | 0.947 | 0.875 |
/
| #pred | #anno | | FP = false pos. | FN = false neg. | | |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
|
           |        |        |      |                380 |                296 |             |             |
exon level | 1088 | 1004 | 708 | ------------------ | ------------------ | 0.705 | 0.651 |
           |   1088 |   1004 |      |  206 |   15 |  159 |  210 |   23 |   63 |             |             |
/
transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity |
|
gene level | 121 | 100 | 1 | 120 | 99 | 0.01 | 0.00826 |
/
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Transcript sensitivity/specificity question

Post by katharina »

by katharina on 30.04.2014 - 10:03
Did you actually look at the gene structures in a browser? I recommend you do that, in context with the training genes (i.e. display both predictions and training genes). Then check what's wrong with the predictions.
From the numbers alone, I assume that the stop codon excluded option is not set the same in predictions as they are in the reference (= training examples).
Post Reply