Bioinformatics Greifswald | Augustus / JoiningTrainingGenes

Joining training gene structures from different sources for training AUGUSTUS

In some cases, the number of training gene structures that was created from one source is not sufficiently high for training AUGUSTUS, or you believe that certain types of genes are underrepresented in a training gene structure file and you want to add some more. The tasks to be accomplished in this case are:

joining the gff files
remove gene structures that overlap on genomic level, i.e. that are redundant
remove gene structures of highly similar proteins, i.e. redundant genes from different genomic locations

One might think that the second and the third task are identical, and depending on the individual situation, the third task might be sufficient, but let's consider the following situation: gene structures of source 1 were created from highly reliable protein sequences, and gene structures of source 2 were created from assembled RNA-Seq data and a transcript to genome alignment. Source 2 might actually lead to a different reading frame than source 1, and in this particular case, it is desirable to discard genes that overlap on genomic level from source 2.

Software requirements:

This protocol was tested with the following versions:

overlap v3.1 (available at http://big.crg.cat/bioinformatics_and_genomics/services_and_software)
augustus & scripts (available at http://bioinf.uni-greifswald.de/augustus/binaries/)
aa2nonred.pl, filterGenesIn.pl & getLinesMatching.pl (available on request from augustus-web@uni-greifswald.de)
NCBI BLAST v2.2.25 (available at http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)

Starting situation

You have two training gene structure files in gff format.

Format example for file 1:

C50592941       PASA CDS     1963    2445    .       +       0       transcript_id "asmbl_45"
scaffold1032576 PASA CDS     4180    4635    .       +       0       transcript_id "asmbl_655"
scaffold112986  PASA CDS     1106    1504    .       +       0       transcript_id "asmbl_876"

Format example for file 2:

scaffold409016  cegma   CDS     2789    3085    154.89  +       .       transcript_id "g1.KOG0003.5"
scaffold661396  cegma   CDS     20      104     25.17   +       .       transcript_id "g2.KOG0019.6"

Discard genes from file 1 that overlap on genomic level with file 2

Here, we assume that file 2 contains "more reliable" training genes. Therefore, we want to remove those genes from file 1 that overlap with genes in file 2.

First, we create a list of "bad genes":

overlap file1.gff file2.gff -st 1 -v 2> /dev/stdout | perl -ne 'if(not(m/\#/)){if(m/ov_feat2: 1/){@t = split(/"/); \
  if(not(exists($seen{$t[1]}))){print $t[1]."\n"; $seen{$t[1]} = 1;}}}' > bad.lst

bad.lst has the following format:

asmbl_6638
asmbl_7799

Then, we remove the "bad entries" from file 1:

grep -v -F -f bad.lst file1.gff > file1.f.gff

Subsequently, we concatenate the filtered file 1 and file 2:

cat file1.f.gff file2.gff > both.gff

Discard genes that are redundant in amino acid sequence

The first task is to convert the gff files and a genome file to amino acid fasta files. This can certainly also be achieved in other ways, but we perfer do it as follows:

gff2gbSmallDNA.pl both.gff genome.fa 10000 both.gb

augustus --species=generic --noprediction=true both.gb > both.aug.gff

getAnnoFasta.pl both.aug.gff

gff2gbSmallDNA.pl produces a genbank gene structure file that can e.g. be used for training augustus, or for converting CDS to amino acid sequence with augustus. You the 10000 is a parameter for the "flanking region". It should be set to a size that is suitable for your individual species! augustus is here not called for predicting genes but for producing the amino acid sequences in gff format (user defined format in comment lines). getAnnoFasta.pl finally extracts the protein sequences and writes them to fasta format. The resulting fasta file both.aug.aa has the following format:

>g149.KOG3349.1.1
MNCSNSKFVFVTVGTTEFTELVRKIISPDTCTVLRSLGFTNIILQAGSSTFSLPETPVSGVNIEAYKYKPTLKEDFEKADLVISHAGAGSCLEALDANKP
LVVVVNDTLLNNHQTELAEELQKKNNLCCCNVENLIKTLQQSMFTDLDPLPRADPTNFALFLGNVMGFN
>g130.KOG2726.5.1
MLSDPEIVNDPDKLRKYSKEQADLQKTVDVYRDYKSKREEVAEIDEMLTETDDKEEIEMLKEEASGLKAAGGDEAAIFAGDLLRMYTKYAESQNFKTEIL
KFENGAHRVQRVPETESGGRIHTSTATVAVLPEVEDVEIEVRNEDLKIDTYRSSGAGGQHVNTTDSAVRITHIPTGIIATSSEKSQIQNREKALKVLKAR
LYDMKLQEEQQKYAAQRKSAVGTGDRSERVRTYNYPQSRVTDHRIGLTLQKLDQIMEGKLDEIIDALTLSEQTEKLKELNNGEL

while both.gb has this format:

LOCUS       scaffold31347_1-1944   1944 bp  DNA
FEATURES             Location/Qualifiers
     source          1..1944
     CDS             complement(1074..1382)
                     /gene="asmbl_4233"
BASE COUNT     643 a   373 c  337 g   591 t
ORIGIN
        1 acgtatggca gctaaccacc atccacacga gtttccatct aattatcagg cacgaaaacg
       61 ttgttcatgt tttcactgta ataaaacata gttttcaatt tcgcagtccg aagtgtaata
      121 aataaaaatg aacggtacat ctttatctcc agaactcggt cgactattcc aacaaagcac
      181 cataatctac aaattcgaca gatcaccttt gtccaaaact atacggacta gatttgcatt
      241 atttaatgat tctaaatttt acgatttctg caatagataa ccgtaaaaga ttacacactt
      301 tatgaaaaat gaggccatcg aataacagcc gtaatgcgcc aagctgttgc ttacatgctc
      361 atatataatc cgctcaaaca ggaccgaggc ttttctttga tatttttaca ttatctggca
      421 aatacgaata ataaatatat tcgtgtctca ggagcttcgt aacttacaaa actctatttt
      481 gatgaaacaa tgacatttaa tgtaacctta ttggaaacca gaaaaagcat taatttacac
      541 gtgacgaatc tgtagaataa taaaacaatc tccaacaagg agataccaaa ttaaaagaga
      601 ctagatgttt aaaaaatgag gttatttgcg gagatgcacg aaatggtcca ttcaaatcat
      661 tggctatgta ttttcgttta ataaaatcag tatatcgtgg ctgaaacatg tttaatagcc
      721 acgttaaacg aaaatgttgg ttcggtatca ctgaactgag aaattaacgt atatatttta
      781 taaaggaatc ttaaaatcac aaattgtgta ccccatgaaa gcaaaaataa ggttttaaaa
      841 tatatatgta tagcacgtgg acagtccaat tgctatatca tcgcagcgaa gtgaaacttt
      901 agattattca tataatactc ttttaaatca tctaaattaa tgttcataca aaacatattc
      961 ccgaatagaa attcacgatg acaatttgga gttttaggat gtgtgaaata gcacttgtgc
     1021 ttgtataaaa acatgtcata aagagaacct tcttaaatta tccaactata atatcaacac
     1081 attgaaattg taacgttgat accaagatta ccattttcag cgaacagatt agccatgctt
     1141 gtgtcgaata caagacagtc actgttctgg attgctgact gaactccttc gtgtatggaa
     1201 cgcggagttg cttcccacgc cagtcttcgt ctgttgccgt ttaattctaa gcgataagcg
     1261 aagttctcgg cttgtttttt tgaacctatt aattgaacaa ttgcatagaa ttgttgtaaa
     1321 ttttccattt tgtcctgctt ttctaacaca agcataaaat ggtagccaaa acaactctgc
     1381 atcataaccc agtcaacagc accgggtaaa tttatatctg tggccagaaa cacaatatct
     1441 tctccttgaa gagtagtaat cgatttatga gcgtgagtta aatgcggcat aacttgttct
     1501 aatgatcctt gccacttaca agaggcacca ggacaagggc aattatatgg cttaaaatca
     1561 catgtgtctt catgttctag tttgtctgta tgaagtaatg ataataaaca tccgctactg
     1621 gtatatttac atggaaacaa tactgtattt gcaaccttct ccattgcaag gtttcgaatg
     1681 cttcccagag ggcccctgca cgtcgggcaa cattgcaatt taggcttgca attcgcacaa
     1741 acaatgtgcc cactttgaca ttgaattata ggaggtaggg cataatcaaa acatacagga
     1801 cactcaaaaa gcgccgctag atcttgaccc gaactttgat gttcgatctg ttgtattata
     1861 ggagtggcaa ttaaagggac tggcgatcct ctgatttgtg gctgcgctgc tgcacgaatc
     1921 tgccctttgt tattcatatc cccg
//

In the next step, blastp is used to identify proteins that have more than 80% sequence identity (here defined as [number of identical aa]/[length of sorter sequence]). aa2nonred.pl runs BLAST and does the filtering:

aa2nonred.pl both.aug.aa both.nr.aa

grep ">" both.nr.aa | perl -pe 's/>//' > nonred.lst

nonred.lst has the following format (identical to the format of bad.lst):

asmbl_45.1
g37.KOG0481.5.1

Now, we need to identify the loci in the genbank file both.gb that correspond to the gene names in nonred.lst:

cat both.gb | perl -ne 'print "$1\t" if (/LOCUS\s*(\S+)\s.*/); print "$1.1\n" if (/gene="(\S+)"/)' | \
  getLinesMatching.pl nonred.lst 2 | cut -f 1 > nonred.loci.lst

nonred.loci.lst has the following format:

scaffold4363_1-4556
scaffold4691_1-7604
scaffold5423_1-6740

As a last step, we now filter both.gb to keep only the nonredundant loci:

filterGenesIn.pl nonred.loci.lst both.gb > both.nr.gb

Removing genes that cause errors during training augustus

This step is not exactly related to joining to training gene sets, only, but you should always remove error causing genes from your genbank file:

etraining --species=generic both.nr.gb 2>&1 | grep "n sequence" | \
  perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > bad.lst

filterGenes.pl bad.lst both.nr.gb > both.nr.good.gb

No warranty for completeness or ability to run. No responsibility for links to external web pages. This manual contains scripts that have not been officially released because they are still in a very early stage of development. Contact: augustus-web@uni-greifswald.de