measuring accuracy

Post by **katharina** » Wed Nov 18, 2015 6:41 pm

Originally posted by Julie in the old forum on 15.06.2012 - 04:43

Hi all,
I am using augustus with different pipelines to annotate a recently assembled plant genome, and now I would like to compare them.
Is there any simple way to calculate some statistics about the annotation process? I would like to know how many features (CDS, exon, complete or partial gene) were annotated in the gff file after running augustus?
Best

Post by **katharina** » Wed Nov 18, 2015 6:41 pm

Originally posted by Mario in the old forum on 15.06.2012 - 11:08

Code: Select all

grep -Pc "\tCDS\t" aug.gff
grep -Pc "\tgene\t" aug.gff
grep -Pc "\ttranscript\t" aug.gff
grep -Pc "\tstart_codon\t" aug.gff
grep -Pc "\tstop_codon\t" aug.gff

count the number of coding exons, genes, transcripts, genes/gene fragments that are complete at the 5'-end or 3'-end, respectively.
A gene has a complete coding region, if both start_codon and stop_codon are predicted.

Code: Select all

cat aug.gff | grep codon | perl -pe 's/.*transcript_id "[^"]+)".*/$1/' > codons.txt

gets you a file with the transcript ids, in which the complete genes
are listed twice, as they have two codon lines.
Then you can do e.g.

Code: Select all

cat codons.txt | perl -ne '$s{$_}++; print if ($s{$_}==2)' | wc -l

To get the number of complete transcripts.

Post by **katharina** » Wed Nov 18, 2015 6:42 pm

Originally posted by Julie in the old forum on 18.06.2012 - 03:03

many thankx

AUGUSTUS Forum

measuring accuracy

measuring accuracy

Re: measuring accuracy

Re: measuring accuracy