Augustus /
CEGMATrainingCreating gene structures for training AUGUSTUS with CEGMA CEGMA is a software that was developed by Genis Parra and Keith Bradnam for finding the gene structures of "core proteins" in genomic sequences. These gene structures can be used for training AUGUSTUS. Actually, the only thing required for feeding the gene structures from CEGMA into AUGUSTUS training (i.e. via the training web service at http://bioinf.uni-greifswald.de/webaugustus) is a small format change. Software requirements: This protocol was tested with the following versions:
Run CEGMA with the default protein set cegma --genome genome.fa Change format of CEGMA gff output The output of CEGMA looks like this: CHROMOSOME_I cegma First 636237 636337 38.40 - 0 KOG0328.7 CHROMOSOME_I cegma Exon 636237 636337 38.40 - . KOG0328.7 CHROMOSOME_I cegma Internal 633894 634800 641.47 - 1 KOG0328.7 CHROMOSOME_I cegma Exon 633894 634800 641.47 - . KOG0328.7 CHROMOSOME_I cegma Internal 632564 632587 -9.15 - 0 KOG0328.7 CHROMOSOME_I cegma Exon 632564 632587 -9.15 - . KOG0328.7 CHROMOSOME_I cegma Terminal 631097 631288 138.28 - 0 KOG0328.7 CHROMOSOME_I cegma Exon 631097 631288 138.28 - . KOG0328.7 For training AUGUSTUS, you need to modify the third column to contain the feature CDS instead of Exon, and you need to modify the last column to contain the field CHROMOSOME_I cegma CDS 636237 636337 38.40 - . transcript_id "g1.KOG0328.7" CHROMOSOME_I cegma CDS 633894 634800 641.47 - . transcript_id "g1.KOG0328.7" CHROMOSOME_I cegma CDS 632564 632587 -9.15 - . transcript_id "g1.KOG0328.7" CHROMOSOME_I cegma CDS 631097 631288 138.28 - . transcript_id "g1.KOG0328.7" The script cegma2gff.pl output.cegma.gff > augustus-training.gff The resulting GFF file can e.g. be submitted to the augustus training web service. Creating hints from CEGMA output Read the instructions about hints that come with the augustus release carefully and adapt your priority and src tag according to your individual needs. A command similar to the following generated hitns for augustus from a CEGMA output: cat output.cegma.gff | grep Exon | perl -ne '@t = split(/\t/); print "$t[0]\t$t[1]\tCDS\t$t[3]\t$t[4]\t.\t$t[6]\t$t[7]\tsrc=P;pri=5;grp=$t[8]";' > cegma.hints The output (cegma.hints) needs to look similar to this: NC_006070 cegma CDS 734398 734403 . - . src=P;pri=5;grp=KOG0002.2 NC_006070 cegma CDS 733507 733656 . - . src=P;pri=5;grp=KOG0002.2 NC_006071 cegma CDS 2264663 2265808 . + . src=P;pri=5;grp=KOG0003.19 No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de |