Comment on low gene level accuracy

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Comment on low gene level accuracy

Post by katharina »

Originally posted in the old forum by Sammy on 05.07.2013 - 13:27
I trained Augustus for my species using protein data. The optimized parameters better accuracy on nucleotide and exon level i.e., 90.8 and 76.4 respectively but the gene level accuracy is only 30.35 % with SN 37.0 and SP 24.7. Could someone please comment on this low accuracy on gene level?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by katharina on 09.07.2013 - 15:17
Your observation is quite common.
It is a lot easier to predict a single nucleotide correctly as "gene" or "not gene", than predicting an entire gene structure correctly (that means ALL nucleotides of a gene strucutre must be classified correctly in order to count a gene as true positive).
To my experience, 30% is not a very high average gene prediction accuracy value, though. 50% would be "good".
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by Sammy on 09.07.2013 - 17:46
Thanks. I already tried with different data-set but still not able to achieve gene level accuracy more than 31.2%. Please tell me what should I do get better gene level accuracy?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by katharina on 09.07.2013 - 22:03
Depending on the individual data situation, it might not be possible to achieve better accuracy in your case.
Things may change with a different assembly, more long read gene expression data, higher transcriptome assembly quality ...
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by Sammy on 10.07.2013 - 01:24
Hi katharina,
Unfortunately, there is no other genome assembly for my organism. I have very small EST set. Should I really forget about using different protein sets for training? We recently did cufflink assembly for transcriptome of my organism. In case I think may be I should go for PASA for creating training set? Is there any tutorial for using pasa for creating training gene structure? How much assembled transcript I should use for creating training set gene structure. Is there any specific criteria for choosing transcripts for training? Should I proceed with autoaug.pl? Sorry for asking so many questions. I have devoted so much time with Augustus and generating hints, and I really want to use it for my work.
Thanks!
Sammy
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by katharina on 10.07.2013 - 14:17
I don't really now what protein sets you used so far.
In any case, I recommend that you try the following:

* use proteins of a related species and run Scipio with those proteins on your target genome
* run CEGMA with the core proteome
*run PASA (or similar) with ESTs/assembled RNA-Seq data

combine the full-length output genes from all approaches in a non-redundant way -> train AUGUSTUS.
Use as much data as possible (because your problem seems to be that the data is not sufficient), but take care that your final training gene set is nonredundant. Only use "perfect" gene structures for training (i.e. not those that are incomplete or "uncertain").
PASA has a very good documentation: http://pasa.sourceforge.net/
AutoAug.pl is not always fully compatible with the newest PASA versions. But I suspect that you need to do lots of "manual" work without AutoAug.pl, anyway.
In some cases, if the data situation allowed that, I selected training genes that have an n-fold coverage of RNA-Seq data for training AUGUSTUS. Doing that leads to a bias towards highly expressed genes, though.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by Sammy on 18.08.2013 - 21:24
Hi Katharina,
As you suggested I have now created training set with CEGMA and PASA. Could you please tell me how can I screen out the
full-length gene structures from these sets.
Thanks.
Sammy
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by katharina on 21.08.2013 - 15:35
Hi Sammy, for the CEGMA genes, please follow the tutorial at http://bioinf.uni-greifswald.de/bioinf/ ... MATraining . If you used the autoAug.pl pipeline to run PASA, the full-length gene structures should already be extracted from the PASA results. Otherwise, please have a look at the commands in http://bioinf.uni-greifswald.de/augustu ... autoAug.pl
For joining both gene sets, please read http://bioinf.uni-greifswald.de/bioinf/ ... iningGenes
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by Sammy on 22.08.2013 - 18:32
Thanks! For PASA I used PASA seperately not with autoAug.pl. I guess everything is okay with the set but I am not sure whether in my training set there are full-length gene structures. Do you have any suggestion to to determine full-length gene structures in my PASA set.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Comment on low gene level accuracy

Post by katharina »

by katharina on 25.08.2013 - 18:09
The commands for finding complete gene structures are listed in the autoAug.pl script that I linked before. Here is the source code, again:

Code: Select all

%25blue%25# find complete genes in candidate training file
    if (!uptodate((["trainingSetCandidates.gff"] or ["trainingSetCandidates.gff3"]), ["trainingSetComplete.gff"])){
	(...)
	# old PASA version (at least before January 2011, probably older) produce different output files than new PASA version:
	if(-e "../pasa/trainingSetCandidates.fasta"){
		$cmdString = 'grep complete ../pasa/trainingSetCandidates.fasta | perl -pe \'s/>(\S+).*/$1\$/\'';
	}else{
		$cmdString = 'grep complete ../pasa/trainingSetCandidates.cds | perl -pe \'s/>(\S+).*/$1\$/\'';
	}
	print "3 $cmdString 1> pasa.complete.lst
" if ($verbose>=3);
	system("$cmdString 1> pasa.complete.lst")==0 or die("
failed to execute $! 
");
	if (! -e "pasa.complete.lst" || -z "pasa.complete.lst"){
            die ("PASA has not constructed any complete training gene. Training aborted because of insufficient data.
");
        }
	# old PASA version (at least before January 2011, probably older) produce different output files than new PASA version:
	if(-e "../pasa/trainingSetCandidates.gff"){
		$cmdString="grep -f pasa.complete.lst ../pasa/trainingSetCandidates.gff >trainingSetComplete.temp.gff";
	}else{
		$cmdString="grep -f pasa.complete.lst ../pasa/trainingSetCandidates.gff3 >trainingSetComplete.temp.gff";
	}
	print "2 Running \"$cmdString\" ..." if ($verbose>=2);
	system("$cmdString")==0 or die("
failed to execute $! 
");
	print " Finished!
" if ($verbose>=2);%25%25
Katharina
Post Reply