Augustus /
IncorporateESTsCreating Hints for AUGUSTUS from ESTs/cDNA sequencesThis protocol is also suitable for integrating long-read RNA-Seq data, e.g. long 454 transcriptome reads (>~450 nt read length). The autoAug-Pipeline (http://bioinf.uni-greifswald.de/augustus/binaries/scripts/) and our webserver application (http://bioinf.uni-greifswald.de/webaugustus/index) use this protocol, so if you apply autoAug.pl or use our webserver, you don't have to create hints from ESTs manually!
Software requirements: This protocol was tested with the following versions:
Please read the installation instructions of each toolkit carefully and follow their advice! 0. Correctly format your fasta headersBadly formatted fasta headers are the most frequent source of problems when working with AUGUSTUS and related scripts. Before actually starting to follow this tutorial, please check that the fasta headers of your files are short and unique and do not contain whitespaces! Positive example: >VDBG_00005T0 MRLIPIHERVELPEDLIPADSRVEIDAKITAGYFTAGKRMTEEELSAVQGRLWEDIDH* >VDBG_00006T0 MSDPSASAITDHVGLGDVLSTLKSIQLTQASLVTAVESLSRTVPQAATGATIDARSAGPN DLDQSLDSNNVADLRASQHHVATSEGPELQAPAVPSSPEQRSGFTSRIVLTPDFTNTEPA SRIGPFPQWGDEKKIVAMDPWGHLAPWLFKDTIENENVDIRPTIAITKAHMKLPELAESV KSGRL* >VDBG_00007T0 MGKRKSSSKPQGPKKKDPLPELFPCLFCNHEDAVKPKVDKKSGVGNLSCKVCGQTFQCSI NYLSAPVDVYSEWVDAADHVSSKQKAVASGLSQGLVTRRMERPIEERDDEGIVADDDEY* Bad example that needs re-formatting of headers: >VDBG_00001T0 | VDBG_00001 | Verticillium albo-atrum VaMs.102 galactosyl transferase GMA12/MNN10 family protein (330 aa) MHGYHHYIATNQAVGDLIENEADRRPQGAWTKPAYLLSLIVAELEKPEDERLEWIFWFDA DTVVVNPSTPLEVFLPPKSDEDLTSVHLLIAANWDGLNSGAFALRVHPWSVSLLSAVLAY PIYMSGRTGKDRFRDQSAFQYLLQDDKSPLANSYTKGKEHWATVPMRWFNALPVNNAFSK NGQGWLFGKKMEGALFDNGTTEIYDDGNGGKIQPWKIMQGDMIVHFAGTTAGGTRDSWMG PWLDRVEALLPEWNNVTTQHRLRDETAKFWSETSARISSEKAIADAKMKLDAEKKAAADK AAEAKKAEEERKKAEEEKKKADEEKQPMD* >VDBG_00002T0 | VDBG_00002 | Verticillium albo-atrum VaMs.102 abhydrolase domain-containing protein (307 aa) MTRYKSRPSLLGRIIHQAMIIRQGRSFSTSTKAHLKLAYELYEPSSSRAIGHDSHPIIFL HGLFGSKKNNRSISKVLARDLGRPVFALDLRNHGESPHDRHHDYTSMASDVAGFIIDHNL DEPTIIGHSMGAKTAMALALRSPDLVRNIISVDNAPVDAVLESGFGNYVEGMKRIERAGV MRQAEADDILKNHEESLPVRQFLLANLYRPQPNKPQQFRVPLDILGRSLGHMADFPFKNP EETRFEKPALFIRGTRSKYVADDVLPLIGQFFPRFRLIDVDAGHWLISEKPEAFREAVVD FLSTSK* 1. Run BLATUse the repeat masked version of your target genome (if available): blat -noHead -minIdentity=92 genome.masked.fa ests.fa ests.psl 2. Filter AlignmentspslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap ests.psl ests.f.psl cat ests.f.psl | sort -n -k 16,16 | sort -s -k 14,14 > ests.fs.psl 3. Create Hintsblat2hints.pl --in=ests.fs.psl --out=ests.E.hints --minintronlen=35 --trunkSS This tutorial was created by Katharina Hoff, last update May 18th 2012. No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de |