One question about the using Scipio to do gene prediction

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

One question about the using Scipio to do gene prediction

Post by Yafei »

Hi guys,

I am using the related specie's protein file to do gene prediction. According to the introduction (http://bioinf.uni-greifswald.de/augustu ... cipio.html), I built the pipeline like below:
perl scipio.1.4.1.pl --blat_output=As.prot.vs.genome.psl As.fa At.gff.aa > As.scipio.yaml
cat As.scipio.yaml | perl yaml2gff.1.4.pl>As.scipio.scipiogff
cat As.scipio.yaml | perl yaml2log.1.4.pl >As.scipio.log
perl scipiogff2gff.pl --in=As.scipio.scipiogff --out As.scipio.gff
perl gff2gbSmallDNA.pl As.scipio.gff As.fa 1500 As.genes.raw.gb
etraining --species=generic --stopCodonExcludedFromCDS=true As.genes.raw.gb 2> train.err
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
perl filterGenes.pl badgenes.lst As.genes.raw.gb > As.genes.gb
grep -c "LOCUS" As.genes.raw.gb As.genes.gb
perl randomSplit.pl As.genes.gb 200
perl new_species.pl --species=As
etraining --species=As As.genes.gb.train
ls -ort $AUGUSTUS_CONFIG_PATHspecies/As/
augustus --species=As.genes.gb.test | tee As.firsttest.out
grep -A 22 Evaluation As.firsttest.out
perl optimize_augustus.pl --species=As As.genes.gb.train
etraining --species=As As.genes.gb.train


Finally, after running the above commands, I am thinking about which commands, looking like below, is better for using?

augustus --species=As As.fa > As.gff3
augustus --species=As --hintsfile=As.scipio.gff As.fa >As.gff3


Thanks a lot,

Best,
Yafei
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

The second command will not work because the gff file is not in hints format. You can easily convert it, though.

HInts format is described shortly in http://bioinf.uni-greifswald.de/augustu ... README.TXT in the section "USING HINTS".
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Hi Katharina,

if I convert the gff file to the right format, do you think the second one with right hints file is better than the first one for gene prediction?
katharina wrote:The second command will not work because the gff file is not in hints format. You can easily convert it, though.

HInts format is described shortly in http://bioinf.uni-greifswald.de/augustu ... README.TXT in the section "USING HINTS".
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

If available, I add the evidence, sure.

Maybe you should run both. You need to examine results in a Browser, anyway, and then you can compare: scipio genes, ab initio predictions and predictions with scipio hints.
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
katharina wrote:If available, I add the evidence, sure.

Maybe you should run both. You need to examine results in a Browser, anyway, and then you can compare: scipio genes, ab initio predictions and predictions with scipio hints.
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Moreover, for the converting, I only need to add ';source=P' at the 9th column. Right?
Thanks a lot.

Yafei
Yafei wrote:Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
katharina wrote:If available, I add the evidence, sure.

Maybe you should run both. You need to examine results in a Browser, anyway, and then you can compare: scipio genes, ab initio predictions and predictions with scipio hints.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

It depends.

source=P or src=P

is the identifier for the scoring system to be used from extrinsic.cfg file for incorporating hints from type "P" (which is typically protein, yes).

There are other possible flags to be added. Priority, e.g.:

pri=3

If you have different sources of evidence with different priority (because you believe one of them to be "more reliable"), you can give different priorities to those different sources. (I guess you don't have any, so you can skip pri flag.)

grp=someName

With grp you can group items together that belong together. This will be the case with your data. You want to have all CDS that belong to the same Scipio gene in one grp.

mult=10

Instead of group, particulary in case of RNA-Seq coverage data, you may specify the "multiplicity" that covers a particular hint (e.g. 10 reads cover a certain exonpart). This is most likely not applicable in your case.

I suggest to have the following content in the last column in your case:

source=P; grp=scipioGeneNames
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

Yafei wrote:Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
I don't think so. We have scripts to generate hints from proteins with exonerate (instead of scipio).
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Hi Katharina,

Thanks your useful reply^^ Could you tell me what's name of the script you mentioned above is?

Yafei
katharina wrote:
Yafei wrote:Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
I don't think so. We have scripts to generate hints from proteins with exonerate (instead of scipio).
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

For my case, what do you think of the two different ways (hints from scipio.gff or from proteins)? For my understanding, all of analysis are based on scipio method, so, maybe using scipio.gff to generate hints is better?
Thanks,^^,sorry for so many questions=-=

Best,
Yafei
Yafei wrote:Hi Katharina,

Thanks your useful reply^^ Could you tell me what's name of the script you mentioned above is?

Yafei
katharina wrote:
Yafei wrote:Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
I don't think so. We have scripts to generate hints from proteins with exonerate (instead of scipio).
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Hi Katharina,

I found the exonerate here (http://bioinf.uni-greifswald.de/bioinf/ ... teProteins)
Thanks,

Best,
Yafei
Yafei wrote:Hi Katharina,

Thanks your useful reply^^ Could you tell me what's name of the script you mentioned above is?

Yafei
katharina wrote:
Yafei wrote:Thanks again, BTW, are there some scripts for directly converting this scipio.gff file to hints file in the software scripts directory?
I don't think so. We have scripts to generate hints from proteins with exonerate (instead of scipio).
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

Yes, you found the right thing. But that is not a script to convert scipio training genes to hints, which is what you want to do. I recommend you script that, yourself. It should be very fast and easy, faster than running exonerate, in any case.
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Hi Katharina,

I write a python script here:
L=[] #the files you need to convert
for i in L:
infile=open(i+'.scipio.gff','r')
outfile=open(i+'.scipio.hints','w')
for line in infile:
outfile.write(line[:line.index('tr')]+'source=P; grp='+line[line.index('tr'):line.index('id')+2]+'_'+line[line.index('"')+1:-2]+'\n')
infile.close()
outfile.close()


I hope it could be right and I would like to share^^
Thanks a lot~

Yafei
katharina wrote:Yes, you found the right thing. But that is not a script to convert scipio training genes to hints, which is what you want to do. I recommend you script that, yourself. It should be very fast and easy, faster than running exonerate, in any case.
Yafei
Posts: 11
Joined: Mon Nov 30, 2015 2:18 am

Re: One question about the using Scipio to do gene prediction

Post by Yafei »

Hi Katharina,

Sorry for bothering again=-=
I have done the gene prediction using the script with Scipio method, which I wrote at the beginning. Unfortunately, the result looks like pretty strange==:
1. Predicted gene numbers of Scipio method (<5000) is much less than the protein numbers which I provided for training (~20000). More, this protein file was generated by transcriptome methods.
2. The gene predicted by Scipio is pretty long, usually one scaffold only has one or two genes, sometimes the scaffolds do not have predicted gene.
I do not know why the result is strange, but I have some thoughts here:
1. There probably were some problems in training: A. the training set only has 200 gene. B. when the training is running,lots of error log look like below:
bucket Error: In sequence scaffold2_1457616-1485791: One CDS exon does not begin properly after the previous CDS exon.10193 >= 10194
GBProcessor::getGeneList(): Intron has non-positive length.
Encountered error after reading 14 annotations.
Error: In sequence scaffold33_762765-845370: One CDS exon does not begin properly after the previous CDS exon.79716 >= 79717
GBProcessor::getGeneList(): Intron has non-positive length.
Encountered error after reading 235 annotations.
Error: In sequence scaffold33_762765-845370: One CDS exon does not begin properly after the previous CDS exon.79716 >= 79717
GBProcessor::getGeneList(): Intron has non-positive length.

2. I'm wondering the one long gene predicted by Scipio method could concatenate several genes.

Here is my Augustus command for running:
augustus --species=As --alternatives-from-evidence=true --hintsfile=As.scipio.hints --extrinsicCfgFile=/work/student/yafei-mao/augustus-3.2.1/config/species/As/extrinsic.MP.cfg --protein=on --introns=on --cds=on --codingseq=on --gff3=on As_151101_pla-v1.2.4.sspace.gaploser.dupremove.ov2k.fa >As_hints.gff3

Here is my Evaluation information for training set:

******* Evaluation of gene prediction *******

---------------------------------------------\
| sensitivity | specificity |
---------------------------------------------|
nucleotide level | 0.885 | 0.809 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
| #pred | #anno | | FP = false pos. | FN = false neg. | | |
| total/ | total/ | TP |--------------------|--------------------| sensitivity | specificity |
| unique | unique | | part | ovlp | wrng | part | ovlp | wrng | | |
----------------------------------------------------------------------------------------------------------|
| | | | 643 | 556 | | |
exon level | 2012 | 1925 | 1369 | ------------------ | ------------------ | 0.711 | 0.68 |
| 2012 | 1925 | | 288 | 19 | 336 | 294 | 26 | 236 | | |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno | TP | FP | FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level | 185 | 198 | 0 | 185 | 198 | 0 | 0 |
----------------------------------------------------------------------------/

------------------------------------------------------------------------\
UTR | total pred | CDS bnd. corr. | meanDiff | medianDiff |
------------------------------------------------------------------------|
TSS | 15 | 0 | -1 | -1 |
TTS | 181 | 0 | -1 | -1 |
------------------------------------------------------------------------|
UTR | uniq. pred | unique anno | sens. | spec. |
------------------------------------------------------------------------|
| true positive = 1 bound. exact, 1 bound. <= 20bp off |
UTR exon level | 0 | 0 | -nan | -nan |
------------------------------------------------------------------------|
UTR base level | 0 | 0 | -nan | -nan |
------------------------------------------------------------------------/


Here is one of predicted gene information from gff3 file:

# Deleted 116 groups because some hint was not satisfiable.
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold9 AUGUSTUS gene 1 418186 0.01 - . ID=g1
scaffold9 AUGUSTUS transcript 1 418186 0.01 - . ID=g1.t1;Parent=g1
scaffold9 AUGUSTUS intron 1 34111 0.07 - . Parent=g1.t1
...
...
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 0
# CDS exons: 0/81
# CDS introns: 0/81
# 5'UTR exons and introns: 0/0
# 3'UTR exons and introns: 0/0
# hint groups fully obeyed: 0
# incompatible hint groups: 2
# P: 2 (transcript_id_300547,transcript_id_300906)
# end gene g1


Here is part of my hints file :

...
scaffold852 Scipio CDS 36271 36882 0.787 + 0 source=P; grp=transcript_id_42537
scaffold185 Scipio CDS 757141 757984 0.336 + 0 source=P; grp=transcript_id_42679
scaffold179 Scipio CDS 36299 37129 0.850 + 0 source=P; grp=transcript_id_42761
scaffold1621 Scipio CDS 9503 9871 0.847 - 0 source=P; grp=transcript_id_3322167
scaffold1621 Scipio CDS 8336 9112 0.847 - 0 source=P; grp=transcript_id_3322167
scaffold60 Scipio CDS 1674214 1674258 0.614 + 0 source=P; grp=transcript_id_42954
scaffold60 Scipio CDS 1674453 1674474 0.614 + 0 source=P; grp=transcript_id_42954
scaffold60 Scipio CDS 1674508 1674527 0.614 + 2 source=P; grp=transcript_id_42954
...


More, I also tried the gene prediction without hints file and the predicted gene number are less than 5000 too=-=

Finally, in my case, I have six related species genomes and one outgroup specie genome (but it is also related with others), but I only have two species' transcriptome data. Now, I use transcriptome data of one species to do gene prediction. After I got the gff3 file, I extracted the protein file from gff3 file and I used this protein file to do other five genomes' gene prediction by Scipio method. But the results, which I described above, are pretty strange.
For my case, do you have any recommend methods to do gene prediction?
katharina wrote:Yes, you found the right thing. But that is not a script to convert scipio training genes to hints, which is what you want to do. I recommend you script that, yourself. It should be very fast and easy, faster than running exonerate, in any case.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: One question about the using Scipio to do gene prediction

Post by katharina »

1. Predicted gene numbers of Scipio method (<5000) is much less than the protein numbers which I provided for training (~20000). More, this protein file was generated by transcriptome methods.
I am not sure whether I understand. You say that you trained AUGUSTUS Scipio. Ok. And then you converted the training CDS (from Scipio to hints). Ok. But how is the protein file generated by a transcriptome method??? A protein file should not be generated by a transcriptome method, it should e.g. come from a database of reliably annotated genes.
2. The gene predicted by Scipio is pretty long, usually one scaffold only has one or two genes, sometimes the scaffolds do not have predicted gene.
You are talking about the training genes here? If yes, have you looked at them in a browser? Do they make sense? If not, you should probably not use for training. (It is normal that some scaffolds may not contain any genes.)
1. There probably were some problems in training: A. the training set only has 200 gene.
Yes, 200 genes are very few. I highly recommend you get more genes for training. Alternatively, consider using already trained parameters of another species. Or try self-training GeneMark.
2. I'm wondering the one long gene predicted by Scipio method could concatenate several genes.
Maybe. You should be able to see that from the Scipio output. You can also align the source proteins to genome and display them in an additional track, compare whether several source proteins correspond to one gene.

There is definitely something wrong with the output of your Evaluation. Transcript accuracy should not be 0! Are you sure that you set the stop codon excluded from cds flag, correctly? (You need to check in the training file whether stop is excluded or included, and then go to the parameter.cfg file of your species and set true or false, depending on the status of the training file.)

How closely related are your species? Is it really necessary to train AUGUSTUS for each species? If - for example - I had dove, chicken and sparrow, I would use chicken parameters for all three birds. If I have dog, human, horse, I will use human parameters for all mammals. It is often better to use one really well prepared parameter set, than several low quality sets.

You could use BRAKER1 to train AUGUSTUS for the two species for which you have transcriptome data. I would not transfrer AUGUSTUS predictions from one species to another using Scipio, in this case. Rather map the transcriptome data to the "foreign" species, directly, and generate hints that you can use for prediction. Do you have any other source of evidence to create training sets, except the transcriptome data of the two species? If you decide that you MUST train AUGUSTUS for each species, you could e.g. use CEGMA, or BUSCO, or a set of reliable proteins (i.e. not simply predicted by AUGUSTUS! You don't know how reliable those proteins are, yet!).
Post Reply