|
Augustus /
JoiningTrainingGenesJoining training gene structures from different sources for training AUGUSTUS In some cases, the number of training gene structures that was created from one source is not sufficiently high for training AUGUSTUS, or you believe that certain types of genes are underrepresented in a training gene structure file and you want to add some more. The tasks to be accomplished in this case are:
One might think that the second and the third task are identical, and depending on the individual situation, the third task might be sufficient, but let's consider the following situation: gene structures of source 1 were created from highly reliable protein sequences, and gene structures of source 2 were created from assembled RNA-Seq data and a transcript to genome alignment. Source 2 might actually lead to a different reading frame than source 1, and in this particular case, it is desirable to discard genes that overlap on genomic level from source 2. Software requirements: This protocol was tested with the following versions:
Starting situation You have two training gene structure files in gff format. Format example for file 1: C50592941 PASA CDS 1963 2445 . + 0 transcript_id "asmbl_45" scaffold1032576 PASA CDS 4180 4635 . + 0 transcript_id "asmbl_655" scaffold112986 PASA CDS 1106 1504 . + 0 transcript_id "asmbl_876" Format example for file 2: scaffold409016 cegma CDS 2789 3085 154.89 + . transcript_id "g1.KOG0003.5" scaffold661396 cegma CDS 20 104 25.17 + . transcript_id "g2.KOG0019.6" Discard genes from file 1 that overlap on genomic level with file 2 Here, we assume that file 2 contains "more reliable" training genes. Therefore, we want to remove those genes from file 1 that overlap with genes in file 2. First, we create a list of "bad genes":
overlap file1.gff file2.gff -st 1 -v 2> /dev/stdout | perl -ne 'if(not(m/\#/)){if(m/ov_feat2: 1/){@t = split(/"/); \
if(not(exists($seen{$t[1]}))){print $t[1]."\n"; $seen{$t[1]} = 1;}}}' > bad.lst
asmbl_6638 asmbl_7799 Then, we remove the "bad entries" from file 1: grep -v -F -f bad.lst file1.gff > file1.f.gff Subsequently, we concatenate the filtered file 1 and file 2: cat file1.f.gff file2.gff > both.gff Discard genes that are redundant in amino acid sequence The first task is to convert the gff files and a genome file to amino acid fasta files. This can certainly also be achieved in other ways, but we perfer do it as follows: gff2gbSmallDNA.pl both.gff genome.fa 10000 both.gb augustus --species=generic --noprediction=true both.gb > both.aug.gff getAnnoFasta.pl both.aug.gff
>g149.KOG3349.1.1 MNCSNSKFVFVTVGTTEFTELVRKIISPDTCTVLRSLGFTNIILQAGSSTFSLPETPVSGVNIEAYKYKPTLKEDFEKADLVISHAGAGSCLEALDANKP LVVVVNDTLLNNHQTELAEELQKKNNLCCCNVENLIKTLQQSMFTDLDPLPRADPTNFALFLGNVMGFN >g130.KOG2726.5.1 MLSDPEIVNDPDKLRKYSKEQADLQKTVDVYRDYKSKREEVAEIDEMLTETDDKEEIEMLKEEASGLKAAGGDEAAIFAGDLLRMYTKYAESQNFKTEIL KFENGAHRVQRVPETESGGRIHTSTATVAVLPEVEDVEIEVRNEDLKIDTYRSSGAGGQHVNTTDSAVRITHIPTGIIATSSEKSQIQNREKALKVLKAR LYDMKLQEEQQKYAAQRKSAVGTGDRSERVRTYNYPQSRVTDHRIGLTLQKLDQIMEGKLDEIIDALTLSEQTEKLKELNNGEL while
LOCUS scaffold31347_1-1944 1944 bp DNA
FEATURES Location/Qualifiers
source 1..1944
CDS complement(1074..1382)
/gene="asmbl_4233"
BASE COUNT 643 a 373 c 337 g 591 t
ORIGIN
1 acgtatggca gctaaccacc atccacacga gtttccatct aattatcagg cacgaaaacg
61 ttgttcatgt tttcactgta ataaaacata gttttcaatt tcgcagtccg aagtgtaata
121 aataaaaatg aacggtacat ctttatctcc agaactcggt cgactattcc aacaaagcac
181 cataatctac aaattcgaca gatcaccttt gtccaaaact atacggacta gatttgcatt
241 atttaatgat tctaaatttt acgatttctg caatagataa ccgtaaaaga ttacacactt
301 tatgaaaaat gaggccatcg aataacagcc gtaatgcgcc aagctgttgc ttacatgctc
361 atatataatc cgctcaaaca ggaccgaggc ttttctttga tatttttaca ttatctggca
421 aatacgaata ataaatatat tcgtgtctca ggagcttcgt aacttacaaa actctatttt
481 gatgaaacaa tgacatttaa tgtaacctta ttggaaacca gaaaaagcat taatttacac
541 gtgacgaatc tgtagaataa taaaacaatc tccaacaagg agataccaaa ttaaaagaga
601 ctagatgttt aaaaaatgag gttatttgcg gagatgcacg aaatggtcca ttcaaatcat
661 tggctatgta ttttcgttta ataaaatcag tatatcgtgg ctgaaacatg tttaatagcc
721 acgttaaacg aaaatgttgg ttcggtatca ctgaactgag aaattaacgt atatatttta
781 taaaggaatc ttaaaatcac aaattgtgta ccccatgaaa gcaaaaataa ggttttaaaa
841 tatatatgta tagcacgtgg acagtccaat tgctatatca tcgcagcgaa gtgaaacttt
901 agattattca tataatactc ttttaaatca tctaaattaa tgttcataca aaacatattc
961 ccgaatagaa attcacgatg acaatttgga gttttaggat gtgtgaaata gcacttgtgc
1021 ttgtataaaa acatgtcata aagagaacct tcttaaatta tccaactata atatcaacac
1081 attgaaattg taacgttgat accaagatta ccattttcag cgaacagatt agccatgctt
1141 gtgtcgaata caagacagtc actgttctgg attgctgact gaactccttc gtgtatggaa
1201 cgcggagttg cttcccacgc cagtcttcgt ctgttgccgt ttaattctaa gcgataagcg
1261 aagttctcgg cttgtttttt tgaacctatt aattgaacaa ttgcatagaa ttgttgtaaa
1321 ttttccattt tgtcctgctt ttctaacaca agcataaaat ggtagccaaa acaactctgc
1381 atcataaccc agtcaacagc accgggtaaa tttatatctg tggccagaaa cacaatatct
1441 tctccttgaa gagtagtaat cgatttatga gcgtgagtta aatgcggcat aacttgttct
1501 aatgatcctt gccacttaca agaggcacca ggacaagggc aattatatgg cttaaaatca
1561 catgtgtctt catgttctag tttgtctgta tgaagtaatg ataataaaca tccgctactg
1621 gtatatttac atggaaacaa tactgtattt gcaaccttct ccattgcaag gtttcgaatg
1681 cttcccagag ggcccctgca cgtcgggcaa cattgcaatt taggcttgca attcgcacaa
1741 acaatgtgcc cactttgaca ttgaattata ggaggtaggg cataatcaaa acatacagga
1801 cactcaaaaa gcgccgctag atcttgaccc gaactttgat gttcgatctg ttgtattata
1861 ggagtggcaa ttaaagggac tggcgatcct ctgatttgtg gctgcgctgc tgcacgaatc
1921 tgccctttgt tattcatatc cccg
//
In the next step, aa2nonred.pl both.aug.aa both.nr.aa grep ">" both.nr.aa | perl -pe 's/>//' > nonred.lst
asmbl_45.1 g37.KOG0481.5.1 Now, we need to identify the loci in the genbank file cat both.gb | perl -ne 'print "$1\t" if (/LOCUS\s*(\S+)\s.*/); print "$1.1\n" if (/gene="(\S+)"/)' | \ getLinesMatching.pl nonred.lst 2 | cut -f 1 > nonred.loci.lst
scaffold4363_1-4556 scaffold4691_1-7604 scaffold5423_1-6740 As a last step, we now filter filterGenesIn.pl nonred.loci.lst both.gb > both.nr.gb Removing genes that cause errors during training augustus This step is not exactly related to joining to training gene sets, only, but you should always remove error causing genes from your genbank file: etraining --species=generic both.nr.gb 2>&1 | grep "n sequence" | \ perl -pe 's/.*n sequence (\S+):.*/$1/' | sort -u > bad.lst filterGenes.pl bad.lst both.nr.gb > both.nr.good.gb No warranty for completeness or ability to run. No responsibility for links to external web pages. This manual contains scripts that have not been officially released because they are still in a very early stage of development. Contact: augustus-web@uni-greifswald.de |