Hi Katharina,
Sorry for bothering again=-=
I have done the gene prediction using the script with Scipio method, which I wrote at the beginning. Unfortunately, the result looks like pretty strange==:
1. Predicted gene numbers of Scipio method (<5000) is much less than the protein numbers which I provided for training (~20000). More, this protein file was generated by transcriptome methods.
2. The gene predicted by Scipio is pretty long, usually one scaffold only has one or two genes, sometimes the scaffolds do not have predicted gene.
I do not know why the result is strange, but I have some thoughts here:
1. There probably were some problems in training: A. the training set only has 200 gene. B. when the training is running,lots of error log look like below:
bucket Error: In sequence scaffold2_1457616-1485791: One CDS exon does not begin properly after the previous CDS exon.10193 >= 10194
GBProcessor::getGeneList(): Intron has non-positive length.
Encountered error after reading 14 annotations.
Error: In sequence scaffold33_762765-845370: One CDS exon does not begin properly after the previous CDS exon.79716 >= 79717
GBProcessor::getGeneList(): Intron has non-positive length.
Encountered error after reading 235 annotations.
Error: In sequence scaffold33_762765-845370: One CDS exon does not begin properly after the previous CDS exon.79716 >= 79717
GBProcessor::getGeneList(): Intron has non-positive length.
2. I'm wondering the one long gene predicted by Scipio method could concatenate several genes.
Here is my Augustus command for running:
augustus --species=As --alternatives-from-evidence=true --hintsfile=As.scipio.hints --extrinsicCfgFile=/work/student/yafei-mao/augustus-3.2.1/config/species/As/extrinsic.MP.cfg --protein=on --introns=on --cds=on --codingseq=on --gff3=on As_151101_pla-v1.2.4.sspace.gaploser.dupremove.ov2k.fa >As_hints.gff3
Here is my Evaluation information for training set:
******* Evaluation of gene prediction *******
---------------------------------------------\
| sensitivity | specificity |
---------------------------------------------|
nucleotide level | 0.885 | 0.809 |
---------------------------------------------/
----------------------------------------------------------------------------------------------------------\
| #pred | #anno | | FP = false pos. | FN = false neg. | | |
| total/ | total/ | TP |--------------------|--------------------| sensitivity | specificity |
| unique | unique | | part | ovlp | wrng | part | ovlp | wrng | | |
----------------------------------------------------------------------------------------------------------|
| | | | 643 | 556 | | |
exon level | 2012 | 1925 | 1369 | ------------------ | ------------------ | 0.711 | 0.68 |
| 2012 | 1925 | | 288 | 19 | 336 | 294 | 26 | 236 | | |
----------------------------------------------------------------------------------------------------------/
----------------------------------------------------------------------------\
transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level | 185 | 198 | 0 | 185 | 198 | 0 | 0 |
----------------------------------------------------------------------------/
------------------------------------------------------------------------\
UTR | total pred | CDS bnd. corr. | meanDiff | medianDiff |
------------------------------------------------------------------------|
TSS | 15 | 0 | -1 | -1 |
TTS | 181 | 0 | -1 | -1 |
------------------------------------------------------------------------|
UTR | uniq. pred | unique anno | sens. | spec. |
------------------------------------------------------------------------|
| true positive = 1 bound. exact, 1 bound. <= 20bp off |
UTR exon level | 0 | 0 | -nan | -nan |
------------------------------------------------------------------------|
UTR base level | 0 | 0 | -nan | -nan |
------------------------------------------------------------------------/
Here is one of predicted gene information from gff3 file:
# Deleted 116 groups because some hint was not satisfiable.
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold9 AUGUSTUS gene 1 418186 0.01 - . ID=g1
scaffold9 AUGUSTUS transcript 1 418186 0.01 - . ID=g1.t1;Parent=g1
scaffold9 AUGUSTUS intron 1 34111 0.07 - . Parent=g1.t1
...
...
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 0
# CDS exons: 0/81
# CDS introns: 0/81
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 2
# P: 2 (transcript_id_300547,transcript_id_300906)
# end gene g1
Here is part of my hints file :
...
scaffold852 Scipio CDS 36271 36882 0.787 + 0 source=P; grp=transcript_id_42537
scaffold185 Scipio CDS 757141 757984 0.336 + 0 source=P; grp=transcript_id_42679
scaffold179 Scipio CDS 36299 37129 0.850 + 0 source=P; grp=transcript_id_42761
scaffold1621 Scipio CDS 9503 9871 0.847 - 0 source=P; grp=transcript_id_3322167
scaffold1621 Scipio CDS 8336 9112 0.847 - 0 source=P; grp=transcript_id_3322167
scaffold60 Scipio CDS 1674214 1674258 0.614 + 0 source=P; grp=transcript_id_42954
scaffold60 Scipio CDS 1674453 1674474 0.614 + 0 source=P; grp=transcript_id_42954
scaffold60 Scipio CDS 1674508 1674527 0.614 + 2 source=P; grp=transcript_id_42954
...
More, I also tried the gene prediction without hints file and the predicted gene number are less than 5000 too=-=
Finally, in my case, I have six related species genomes and one outgroup specie genome (but it is also related with others), but I only have two species' transcriptome data. Now, I use transcriptome data of one species to do gene prediction. After I got the gff3 file, I extracted the protein file from gff3 file and I used this protein file to do other five genomes' gene prediction by Scipio method. But the results, which I described above, are pretty strange.
For my case, do you have any recommend methods to do gene prediction?
katharina wrote:Yes, you found the right thing. But that is not a script to convert scipio training genes to hints, which is what you want to do. I recommend you script that, yourself. It should be very fast and easy, faster than running exonerate, in any case.