Originally posted by Julie in the old forum on 15.06.2012 - 04:43
Hi all,
I am using augustus with different pipelines to annotate a recently assembled plant genome, and now I would like to compare them.
Is there any simple way to calculate some statistics about the annotation process? I would like to know how many features (CDS, exon, complete or partial gene) were annotated in the gff file after running augustus?
Best
measuring accuracy
Moderator: bioinf
Re: measuring accuracy
Originally posted by Mario in the old forum on 15.06.2012 - 11:08
count the number of coding exons, genes, transcripts, genes/gene fragments that are complete at the 5'-end or 3'-end, respectively.
A gene has a complete coding region, if both start_codon and stop_codon are predicted.
gets you a file with the transcript ids, in which the complete genes
are listed twice, as they have two codon lines.
Then you can do e.g.
To get the number of complete transcripts.
Code: Select all
grep -Pc "\tCDS\t" aug.gff
grep -Pc "\tgene\t" aug.gff
grep -Pc "\ttranscript\t" aug.gff
grep -Pc "\tstart_codon\t" aug.gff
grep -Pc "\tstop_codon\t" aug.gff
A gene has a complete coding region, if both start_codon and stop_codon are predicted.
Code: Select all
cat aug.gff | grep codon | perl -pe 's/.*transcript_id "[^"]+)".*/$1/' > codons.txt
are listed twice, as they have two codon lines.
Then you can do e.g.
Code: Select all
cat codons.txt | perl -ne '$s{$_}++; print if ($s{$_}==2)' | wc -l
Re: measuring accuracy
Originally posted by Julie in the old forum on 18.06.2012 - 03:03
many thankx
many thankx