Augustus /
IncorporateProteinsCreating Hints for AUGUSTUS from Protein SequencesDo not confuse proteins with peptides! This page describes the generation of hints from full-length proteins.
Software requirements: This protocol was tested with the following versions:
Please read the installation instructions of each toolkit carefully and follow their advice! 0. Correctly format your fasta headersBefore actually starting to follow this tutorial, please check that the fasta headers of your genome and protein files are short and unique and do not contain whitespaces! Positive example: >VDBG_00005T0 MRLIPIHERVELPEDLIPADSRVEIDAKITAGYFTAGKRMTEEELSAVQGRLWEDIDH* >VDBG_00006T0 MSDPSASAITDHVGLGDVLSTLKSIQLTQASLVTAVESLSRTVPQAATGATIDARSAGPN DLDQSLDSNNVADLRASQHHVATSEGPELQAPAVPSSPEQRSGFTSRIVLTPDFTNTEPA SRIGPFPQWGDEKKIVAMDPWGHLAPWLFKDTIENENVDIRPTIAITKAHMKLPELAESV KSGRL* >VDBG_00007T0 MGKRKSSSKPQGPKKKDPLPELFPCLFCNHEDAVKPKVDKKSGVGNLSCKVCGQTFQCSI NYLSAPVDVYSEWVDAADHVSSKQKAVASGLSQGLVTRRMERPIEERDDEGIVADDDEY* Bad example that needs re-formatting of headers: >VDBG_00001T0 | VDBG_00001 | Verticillium albo-atrum VaMs.102 galactosyl transferase GMA12/MNN10 family protein (330 aa) MHGYHHYIATNQAVGDLIENEADRRPQGAWTKPAYLLSLIVAELEKPEDERLEWIFWFDA DTVVVNPSTPLEVFLPPKSDEDLTSVHLLIAANWDGLNSGAFALRVHPWSVSLLSAVLAY PIYMSGRTGKDRFRDQSAFQYLLQDDKSPLANSYTKGKEHWATVPMRWFNALPVNNAFSK NGQGWLFGKKMEGALFDNGTTEIYDDGNGGKIQPWKIMQGDMIVHFAGTTAGGTRDSWMG PWLDRVEALLPEWNNVTTQHRLRDETAKFWSETSARISSEKAIADAKMKLDAEKKAAADK AAEAKKAEEERKKAEEEKKKADEEKQPMD* >VDBG_00002T0 | VDBG_00002 | Verticillium albo-atrum VaMs.102 abhydrolase domain-containing protein (307 aa) MTRYKSRPSLLGRIIHQAMIIRQGRSFSTSTKAHLKLAYELYEPSSSRAIGHDSHPIIFL HGLFGSKKNNRSISKVLARDLGRPVFALDLRNHGESPHDRHHDYTSMASDVAGFIIDHNL DEPTIIGHSMGAKTAMALALRSPDLVRNIISVDNAPVDAVLESGFGNYVEGMKRIERAGV MRQAEADDILKNHEESLPVRQFLLANLYRPQPNKPQQFRVPLDILGRSLGHMADFPFKNP EETRFEKPALFIRGTRSKYVADDVLPLIGQFFPRFRLIDVDAGHWLISEKPEAFREAVVD FLSTSK* 1. Split protein fileNCBI-BLAST version 2.2.24 throws an error if the input query file is too large. Therefore, we split the protein file in smaller batches: splitMfasta.pl protein.fa --minsize=20000 --outputpath=split_prot_directory 2. Create a list of potential matchesAttention: this step is optional! We use BLAST only to speed up exonerate, later. If you encounter problems with executing this step, just skip it and run exonerate, directly! Using BLAST to create a list of potential queries and targets that have matches prior actually running exonerate saves a lot of time. Run BLAST (formatted as a Sun Grid Engine Script where nFiles needs to be replaced by the actual number of files residing in #!/bin/bash #$ -S /bin/bash #$ -cwd #$ -m e #$ -t 1-nFiles tblastn -query protein.split.$SGE_TASK_ID.fa -subject genome.fa > blast.$SGE_TASK_ID.out Subsequent running BLAST, you should check whether it finished without errors: # check for errors grep "Assertion failed" blast.* Continue if no such errors are reported. Otherwise, split protein file in even smaller batch files and try again. Create indices for protein and genome files: cdbfasta protein.fa cdbfasta genome.fa Produce a list of potential query-target combinations: # create list with list of genomic contigs matching each protein, discard all hits with e-value > 1*10^-5 cat blast.* | allBlastMatches_ncbi-blast.pl > tblastn.matches cat tblastn.matches | perl -e 'while(<>){split; if ($q eq $_[0]){$t .= "\t$_[1]"} else {print "$q$t\n"; $t="\t$_[1]";$q=$_[0];}} print "$q$t\n";' > tblastn.matchlists Split the matchlist into smaller files in order to run exonerate on a Sun Grid Engine in parallel mode: split -l 1000 tblastn.matchlists -d 3. Run ExonerateYou can either sequentially run exonerate in a bash loop, or wrap it in a SGE array script. The central command is (the exonerate binary must be in your path!): runExonerate.pl x00 x00 protein.fa.cidx genome.fa.cidx You'll need to run this command for all the x00, x01, ... files that were produced by the split command. Subsequently, concatenate the results: for i in 00 01 ... do cat exonerate.x$i.out >> exonerate.out done 4. Create hintsexonerate2hints.pl --in=exonerate.out --source=P --out=exonerate.hints This tutorial was created by Katharina Hoff, last update May 18th 2012. No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de |