Augustus /
UTRTrainingTraining AUGUSTUS UTR parametersUTR parameters are of particular importance for integrating RNA-seq evidence in the form of If you executed autoAug.pl (offline or via http://bioinf.uni-greifswald.de/webaugustus) with a cDNA file and a genome file as input, and autoAug.pl did not issue any warnings or error messages, you do not need to repeat UTR training, manually, because you already have UTR parameters for your species! If you executed autoAug.pl with a protein file and a genome file as input, UTR training examples were not generated and thus, no UTR parameters were trained. Training gene structure file formatThe input file for UTR training as described in this tutorial must contain CDS, 5'-UTR, and 3'-UTR features in gff format. It does not matter what source is depicted in column 2 (here scaffold1 manual 5'-UTR 2530693 2530772 . + 0 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; scaffold1 AUGUSTUS CDS 2530773 2530830 0.76 + 0 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; scaffold1 AUGUSTUS CDS 2530893 2531019 1 + 2 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; scaffold1 AUGUSTUS CDS 2531114 2531422 1 + 1 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; scaffold1 AUGUSTUS CDS 2531483 2531588 0.98 + 1 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; scaffold1 manual 3'-UTR 2531592 2531937 . + 0 transcript_id "au2.g1000.t1"; gene_id "au2.g1000"; The training gene structure file must not contain UTR examples for all genes, but for a sufficiently high number, UTRs must be present. Extracting genes for which both UTR examples are presentcat genes.gff | perl -ne 's/.*\t(\S+UTR)\t.*transcript_id \"(\S+)\".*/$2\t$1/; print;' | sort -u | perl -ne 'split; print "$_[0]\n" if ($g eq $_[0]); $g = $_[0];' > bothutr.lst Be aware on how the above filtering command relies on the structure of the last column in the gff file! This command creates a list of all genes where both UTRs are present. Format: au2.g1000.t1 au2.g1037.t1 au2.g1038.t1 ... Create a training file in genbank formatgff2gbSmallDNA.pl genes.gff genome.fa 5000 bothutr.gb --good=bothutr.lst You need to adapt the flanking region length (here 5000) to a suitable value for your target genome! The perl script will automatically take care that coding regions of neighboring genes listed in genes.gff are excluded from the flanking region in the genbank file. In addition, genbank training entries will only be created for the genes listed in It is a good idea to hold out a test data set for measuring accuracy after training. Let's assume that you have 500 training genes with both UTRs available and you want to hold out 100 for testing: randomSplit.pl bothutr.gb 100 This produces a file bothutr.gb.train with 400 entries, and a file bothutr.gb.test with 100 entries. Optimizing AUGUSTUS UTR parametersetraining --species=yourSpecies bothutr.gb.train optimize_augustus.pl --species=yourSpecies --cpus=8 --rounds=3 bothutr.gb.train --UTR=on --metapars=/pathToYourAUGUSTUS/config/species/yourSpecies/yourSpecies_metapars.utr.cfg --trainOnlyUtr=1
Test prediction accuracyAfter training, you can test the accuracy of prediction on the test data set: augustus --species=yourSpecies bothutr.gb.test This will produce an output that contains accuracy values. Be aware that you need to adjust the yourSpecies_parameters.cfg file to predict UTRs by default! Call AUGUSTUS with the parameter |