Bioinformatics Greifswald | Augustus / CEGMATraining

Creating gene structures for training AUGUSTUS with CEGMA

CEGMA is a software that was developed by Genis Parra and Keith Bradnam for finding the gene structures of "core proteins" in genomic sequences. These gene structures can be used for training AUGUSTUS. Actually, the only thing required for feeding the gene structures from CEGMA into AUGUSTUS training (i.e. via the training web service at http://bioinf.uni-greifswald.de/webaugustus) is a small format change.

Software requirements:

This protocol was tested with the following versions:

CEGMA v.2.4.010312 available at http://korflab.ucdavis.edu/datasets/cegma/
cegma2gff.pl (available on request from augustus-web[at]uni-greifswald.de, should be included in the next AUGUSTUS release)

Run CEGMA with the default protein set

cegma --genome genome.fa

Change format of CEGMA gff output

The output of CEGMA looks like this:

CHROMOSOME_I    cegma   First   636237  636337  38.40   -       0       KOG0328.7
CHROMOSOME_I    cegma   Exon    636237  636337  38.40   -       .       KOG0328.7
CHROMOSOME_I    cegma   Internal        633894  634800  641.47  -       1       KOG0328.7
CHROMOSOME_I    cegma   Exon    633894  634800  641.47  -       .       KOG0328.7
CHROMOSOME_I    cegma   Internal        632564  632587  -9.15   -       0       KOG0328.7
CHROMOSOME_I    cegma   Exon    632564  632587  -9.15   -       .       KOG0328.7
CHROMOSOME_I    cegma   Terminal        631097  631288  138.28  -       0       KOG0328.7
CHROMOSOME_I    cegma   Exon    631097  631288  138.28  -       .       KOG0328.7

For training AUGUSTUS, you need to modify the third column to contain the feature CDS instead of Exon, and you need to modify the last column to contain the field transcript_id "someUniqueID". The result can e.g. look like this:

CHROMOSOME_I    cegma   CDS    636237  636337  38.40   -       .       transcript_id "g1.KOG0328.7"
CHROMOSOME_I    cegma   CDS    633894  634800  641.47  -       .       transcript_id "g1.KOG0328.7"
CHROMOSOME_I    cegma   CDS    632564  632587  -9.15   -       .       transcript_id "g1.KOG0328.7"
CHROMOSOME_I    cegma   CDS    631097  631288  138.28  -       .       transcript_id "g1.KOG0328.7"

The script cegma2gff.pl makes those format changes:

cegma2gff.pl output.cegma.gff > augustus-training.gff

The resulting GFF file can e.g. be submitted to the augustus training web service.

Creating hints from CEGMA output

Read the instructions about hints that come with the augustus release carefully and adapt your priority and src tag according to your individual needs. A command similar to the following generated hitns for augustus from a CEGMA output:

cat output.cegma.gff | grep Exon | perl -ne '@t = split(/\t/); print "$t[0]\t$t[1]\tCDS\t$t[3]\t$t[4]\t.\t$t[6]\t$t[7]\tsrc=P;pri=5;grp=$t[8]";' > cegma.hints

The output (cegma.hints) needs to look similar to this:

NC_006070       cegma   CDS     734398  734403  .       -       .       src=P;pri=5;grp=KOG0002.2
NC_006070       cegma   CDS     733507  733656  .       -       .       src=P;pri=5;grp=KOG0002.2
NC_006071       cegma   CDS     2264663 2265808 .       +       .       src=P;pri=5;grp=KOG0003.19

No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de