Training AUGUSTUS UTR parameters

UTR parameters are of particular importance for integrating RNA-seq evidence in the form of exonpart hints into AUGUSTUS gene predictions. A model for the coding regions (CDS) only may work, but could suffer from false positive predictions in longer UTRs. This tutorial describes how you can train UTR parameters for your target species on training data that was generated from an arbitrary source.

If you executed autoAug.pl (offline or via http://bioinf.uni-greifswald.de/webaugustus) with a cDNA file and a genome file as input, and autoAug.pl did not issue any warnings or error messages, you do not need to repeat UTR training, manually, because you already have UTR parameters for your species!

If you executed autoAug.pl with a protein file and a genome file as input, UTR training examples were not generated and thus, no UTR parameters were trained.

Training gene structure file format

The input file for UTR training as described in this tutorial must contain CDS, 5'-UTR, and 3'-UTR features in gff format. It does not matter what source is depicted in column 2 (here manual or AUGUSTUS), but it is important that the last column contains the same grouping identifier for all entries of a gene (here transcript_id "au2.g1000.t1"; gene_id "au2.g1000";)

scaffold1 manual  5'-UTR  2530693 2530772 .       +       0       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";
scaffold1 AUGUSTUS        CDS     2530773 2530830 0.76    +       0       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";
scaffold1 AUGUSTUS        CDS     2530893 2531019 1       +       2       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";
scaffold1 AUGUSTUS        CDS     2531114 2531422 1       +       1       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";
scaffold1 AUGUSTUS        CDS     2531483 2531588 0.98    +       1       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";
scaffold1 manual  3'-UTR  2531592 2531937 .       +       0       transcript_id "au2.g1000.t1"; gene_id "au2.g1000";

The training gene structure file must not contain UTR examples for all genes, but for a sufficiently high number, UTRs must be present.

Extracting genes for which both UTR examples are present

cat genes.gff | perl -ne 's/.*\t(\S+UTR)\t.*transcript_id \"(\S+)\".*/$2\t$1/; print;' | sort -u | perl -ne 'split; print "$_[0]\n" if ($g eq $_[0]); $g = $_[0];' > bothutr.lst

Be aware on how the above filtering command relies on the structure of the last column in the gff file! This command creates a list of all genes where both UTRs are present. Format:

au2.g1000.t1
au2.g1037.t1
au2.g1038.t1
...

Create a training file in genbank format

gff2gbSmallDNA.pl genes.gff genome.fa 5000 bothutr.gb --good=bothutr.lst

You need to adapt the flanking region length (here 5000) to a suitable value for your target genome! The perl script will automatically take care that coding regions of neighboring genes listed in genes.gff are excluded from the flanking region in the genbank file. In addition, genbank training entries will only be created for the genes listed in bothutr.lst.

It is a good idea to hold out a test data set for measuring accuracy after training. Let's assume that you have 500 training genes with both UTRs available and you want to hold out 100 for testing:

randomSplit.pl bothutr.gb 100

This produces a file bothutr.gb.train with 400 entries, and a file bothutr.gb.test with 100 entries.

Optimizing AUGUSTUS UTR parameters

etraining --species=yourSpecies bothutr.gb.train

optimize_augustus.pl --species=yourSpecies --cpus=8 --rounds=3 bothutr.gb.train --UTR=on --metapars=/pathToYourAUGUSTUS/config/species/yourSpecies/yourSpecies_metapars.utr.cfg --trainOnlyUtr=1

optimize_augustus.pl is very time consuming but can run in a parallalized fashion. You should adjust the value of --cpus to something suitable for your computer. If the file /pathToYourAUGUSTUS/config/species/yourSpecies/yourSpecies_metapars.utr.pbl does not exist, yet, you need to create it, e.g. by copying the contents the utr.pbl file of another species.

Test prediction accuracy

After training, you can test the accuracy of prediction on the test data set:

augustus --species=yourSpecies bothutr.gb.test

This will produce an output that contains accuracy values.