Augustus /
JoiningTrainingGenesJoining training gene structures from different sources for training AUGUSTUS In some cases, the number of training gene structures that was created from one source is not sufficiently high for training AUGUSTUS, or you believe that certain types of genes are underrepresented in a training gene structure file and you want to add some more. The tasks to be accomplished in this case are:
One might think that the second and the third task are identical, and depending on the individual situation, the third task might be sufficient, but let's consider the following situation: gene structures of source 1 were created from highly reliable protein sequences, and gene structures of source 2 were created from assembled RNA-Seq data and a transcript to genome alignment. Source 2 might actually lead to a different reading frame than source 1, and in this particular case, it is desirable to discard genes that overlap on genomic level from source 2. Software requirements: This protocol was tested with the following versions:
Starting situation You have two training gene structure files in gff format. Format example for file 1: C50592941 PASA CDS 1963 2445 . + 0 transcript_id "asmbl_45" scaffold1032576 PASA CDS 4180 4635 . + 0 transcript_id "asmbl_655" scaffold112986 PASA CDS 1106 1504 . + 0 transcript_id "asmbl_876" Format example for file 2: scaffold409016 cegma CDS 2789 3085 154.89 + . transcript_id "g1.KOG0003.5" scaffold661396 cegma CDS 20 104 25.17 + . transcript_id "g2.KOG0019.6" Discard genes from file 1 that overlap on genomic level with file 2 Here, we assume that file 2 contains "more reliable" training genes. Therefore, we want to remove those genes from file 1 that overlap with genes in file 2. First, we create a list of "bad genes": overlap file1.gff file2.gff -st 1 -v 2> /dev/stdout | perl -ne 'if(not(m/\#/)){if(m/ov_feat2: 1/){@t = split(/"/); \ if(not(exists($seen{$t[1]}))){print $t[1]."\n"; $seen{$t[1]} = 1;}}}' > bad.lst
asmbl_6638 asmbl_7799 Then, we remove the "bad entries" from file 1: grep -v -F -f bad.lst file1.gff > file1.f.gff Subsequently, we concatenate the filtered file 1 and file 2: cat file1.f.gff file2.gff > both.gff Discard genes that are redundant in amino acid sequence The first task is to convert the gff files and a genome file to amino acid fasta files. This can certainly also be achieved in other ways, but we perfer do it as follows: gff2gbSmallDNA.pl both.gff genome.fa 10000 both.gb augustus --species=generic --noprediction=true both.gb > both.aug.gff getAnnoFasta.pl both.aug.gff
>g149.KOG3349.1.1 MNCSNSKFVFVTVGTTEFTELVRKIISPDTCTVLRSLGFTNIILQAGSSTFSLPETPVSGVNIEAYKYKPTLKEDFEKADLVISHAGAGSCLEALDANKP LVVVVNDTLLNNHQTELAEELQKKNNLCCCNVENLIKTLQQSMFTDLDPLPRADPTNFALFLGNVMGFN >g130.KOG2726.5.1 MLSDPEIVNDPDKLRKYSKEQADLQKTVDVYRDYKSKREEVAEIDEMLTETDDKEEIEMLKEEASGLKAAGGDEAAIFAGDLLRMYTKYAESQNFKTEIL KFENGAHRVQRVPETESGGRIHTSTATVAVLPEVEDVEIEVRNEDLKIDTYRSSGAGGQHVNTTDSAVRITHIPTGIIATSSEKSQIQNREKALKVLKAR LYDMKLQEEQQKYAAQRKSAVGTGDRSERVRTYNYPQSRVTDHRIGLTLQKLDQIMEGKLDEIIDALTLSEQTEKLKELNNGEL while LOCUS scaffold31347_1-1944 1944 bp DNA FEATURES Location/Qualifiers source 1..1944 CDS complement(1074..1382) /gene="asmbl_4233" BASE COUNT 643 a 373 c 337 g 591 t ORIGIN 1 acgtatggca gctaaccacc atccacacga gtttccatct aattatcagg cacgaaaacg 61 ttgttcatgt tttcactgta ataaaacata gttttcaatt tcgcagtccg aagtgtaata 121 aataaaaatg aacggtacat ctttatctcc agaactcggt cgactattcc aacaaagcac 181 cataatctac aaattcgaca gatcaccttt gtccaaaact atacggacta gatttgcatt 241 atttaatgat tctaaatttt acgatttctg caatagataa ccgtaaaaga ttacacactt 301 tatgaaaaat gaggccatcg aataacagcc gtaatgcgcc aagctgttgc ttacatgctc 361 atatataatc cgctcaaaca ggaccgaggc ttttctttga tatttttaca ttatctggca 421 aatacgaata ataaatatat tcgtgtctca ggagcttcgt aacttacaaa actctatttt 481 gatgaaacaa tgacatttaa tgtaacctta ttggaaacca gaaaaagcat taatttacac 541 gtgacgaatc tgtagaataa taaaacaatc tccaacaagg agataccaaa ttaaaagaga 601 ctagatgttt aaaaaatgag gttatttgcg gagatgcacg aaatggtcca ttcaaatcat 661 tggctatgta ttttcgttta ataaaatcag tatatcgtgg ctgaaacatg tttaatagcc 721 acgttaaacg aaaatgttgg ttcggtatca ctgaactgag aaattaacgt atatatttta 781 taaaggaatc ttaaaatcac aaattgtgta ccccatgaaa gcaaaaataa ggttttaaaa 841 tatatatgta tagcacgtgg acagtccaat tgctatatca tcgcagcgaa gtgaaacttt 901 agattattca tataatactc ttttaaatca tctaaattaa tgttcataca aaacatattc 961 ccgaatagaa attcacgatg acaatttgga gttttaggat gtgtgaaata gcacttgtgc 1021 ttgtataaaa acatgtcata aagagaacct tcttaaatta tccaactata atatcaacac 1081 attgaaattg taacgttgat accaagatta ccattttcag cgaacagatt agccatgctt 1141 gtgtcgaata caagacagtc actgttctgg attgctgact gaactccttc gtgtatggaa 1201 cgcggagttg cttcccacgc cagtcttcgt ctgttgccgt ttaattctaa gcgataagcg 1261 aagttctcgg cttgtttttt tgaacctatt aattgaacaa ttgcatagaa ttgttgtaaa 1321 ttttccattt tgtcctgctt ttctaacaca agcataaaat ggtagccaaa acaactctgc 1381 atcataaccc agtcaacagc accgggtaaa tttatatctg tggccagaaa cacaatatct 1441 tctccttgaa gagtagtaat cgatttatga gcgtgagtta aatgcggcat aacttgttct 1501 aatgatcctt gccacttaca agaggcacca ggacaagggc aattatatgg cttaaaatca 1561 catgtgtctt catgttctag tttgtctgta tgaagtaatg ataataaaca tccgctactg 1621 gtatatttac atggaaacaa tactgtattt gcaaccttct ccattgcaag gtttcgaatg 1681 cttcccagag ggcccctgca cgtcgggcaa cattgcaatt taggcttgca attcgcacaa 1741 acaatgtgcc cactttgaca ttgaattata ggaggtaggg cataatcaaa acatacagga 1801 cactcaaaaa gcgccgctag atcttgaccc gaactttgat gttcgatctg ttgtattata 1861 ggagtggcaa ttaaagggac tggcgatcct ctgatttgtg gctgcgctgc tgcacgaatc 1921 tgccctttgt tattcatatc cccg // In the next step, aa2nonred.pl both.aug.aa both.nr.aa grep ">" both.nr.aa | perl -pe 's/>//' > nonred.lst
asmbl_45.1 g37.KOG0481.5.1 Now, we need to identify the loci in the genbank file cat both.gb | perl -ne 'print "$1\t" if (/LOCUS\s*(\S+)\s.*/); print "$1.1\n" if (/gene="(\S+)"/)' | \ getLinesMatching.pl nonred.lst 2 | cut -f 1 > nonred.loci.lst
scaffold4363_1-4556 scaffold4691_1-7604 scaffold5423_1-6740 As a last step, we now filter filterGenesIn.pl nonred.loci.lst both.gb > both.nr.gb Removing genes that cause errors during training augustus This step is not exactly related to joining to training gene sets, only, but you should always remove error causing genes from your genbank file: etraining --species=generic both.nr.gb 2>&1 | grep "n sequence" | \ perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > bad.lst filterGenes.pl bad.lst both.nr.gb > both.nr.good.gb No warranty for completeness or ability to run. No responsibility for links to external web pages. This manual contains scripts that have not been officially released because they are still in a very early stage of development. Contact: augustus-web@uni-greifswald.de |