by suzuki on 29.07.2012 - 10:05
According to the manual "Training AUGUSTUS" (
http://bioinf.uni-greifswald.de/augustu ... ining.html),
I trained with Paramecium tetraurelia (ptetraurelia_annotation_v1.gb) on Linux.
Code: Select all
randomSplit.pl ptetraurelia_annotation_v1.gb 100
grep -c LOCUS ptetraurelia_annotation_v1.gb*
# ptetraurelia_annotation_v1.gb:697
# ptetraurelia_annotation_v1.gb.test:100
# ptetraurelia_annotation_v1.gb.train:597
new_species.pl --species=paramecium
According to 'SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE',
I edited the parameter file ($AUGUSTUS_CONFIG_PATH/species/paramecium/paramecium_parameters.cfg).
Code: Select all
translation_table 6
/Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
/Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
/Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid
Then, I made an initial training.
Code: Select all
etraining --species=paramecium ptetraurelia_annotation_v1.gb.train
ls -ort $AUGUSTUS_CONFIG_PATH/species/paramecium/
augustus --species=paramecium ptetraurelia_annotation_v1.gb.test | tee firsttest.out # takes 17m
grep -A 22 Evaluation firsttest.out
Here is the accuracy report at the end of firsttest.out.
Code: Select all
******* Evaluation of gene prediction *******
---------------------------------------------
| sensitivity | specificity |
---------------------------------------------|
nucleotide level | 0.803 | 0.964 |
---------------------------------------------/
----------------------------------------------------------------------------------------------------------
| #pred | #anno | | FP = false pos. | FN = false neg. | | |
| total/ | total/ | TP |--------------------|--------------------| sensitivity | specificity |
| unique | unique | | part | ovlp | wrng | part | ovlp | wrng | | |
----------------------------------------------------------------------------------------------------------|
| | | | 11091 | 18586 | | |
exon level | 14148 | 21643 | 3057 | ------------------ | ------------------ | 0.141 | 0.216 |
| 14148 | 21643 | | 7222 | 3434 | 435 | 7621 | 6814 | 4151 | | |
----------------------------------------------------------------------------------------------------------/
----------------------------------------------------------------------------
transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level | 5531 | 6666 | 739 | 4792 | 5927 | 0.111 | 0.134 |
----------------------------------------------------------------------------/
At gene level, sensitivity and specificity are 0.111 and 0.134, respectively (of the 6666 genes, 739 were predicted exactly).
When the tetrahymena parameters were used for the test,
augustus --species=tetrahymena ptetraurelia_annotation_v1.gb.test
at gene level, sensitivity and specificity are 0.0101 and 0.0311, respectively (of the 6666 genes, 67 were predicted exactly).
When the self-trained paramecium parameters were used for predicting ORF in mRNAs of Paramecium tetraurelia (Pte.seq.uniq),
Code: Select all
augustus --species=paramecium Pte.seq.uniq --extrinsicCfgFile=extrinsic.cfg --hintsfile=hints.gff > augustus.abinitio.gff
augustus predicted 5146 CDS in 5102 of the 5230 mRNAs. Of the 5146 CDS, 3668 contain start_codon, and 4747 contain stop_codon.
Thus, the self-trained paramecium parameters seemed to give better results than the tetrahymena parameters.