Bioinformatics Greifswald | Augustus / IncorporateESTs

Creating Hints for AUGUSTUS from ESTs/cDNA sequences

This protocol is also suitable for integrating long-read RNA-Seq data, e.g. long 454 transcriptome reads (>~450 nt read length).

The autoAug-Pipeline (http://bioinf.uni-greifswald.de/augustus/binaries/scripts/) and our webserver application (http://bioinf.uni-greifswald.de/webaugustus/index) use this protocol, so if you apply autoAug.pl or use our webserver, you don't have to create hints from ESTs manually!

Input: EST or cDNA fasta file and genome fasta file
Output: AUGUSTUS hints

Software requirements:

This protocol was tested with the following versions:

BLAT version 34 (freely available for non-profit usage at http://hgdownload.cse.ucsc.edu/admin/exe/)
Perl verson 5.10.1 (available at http://www.perl.org/get.html)
Ubuntu (any unix system with a bash will do)
Custom perl scripts: (some available at http://bioinf.uni-greifswald.de/augustus/binaries/scripts/ more at https://github.com/Gaius-Augustus/Augustus/tree/master/scripts)

Please read the installation instructions of each toolkit carefully and follow their advice!

0. Correctly format your fasta headers

Badly formatted fasta headers are the most frequent source of problems when working with AUGUSTUS and related scripts. Before actually starting to follow this tutorial, please check that the fasta headers of your files are short and unique and do not contain whitespaces!

Positive example:

>VDBG_00005T0
MRLIPIHERVELPEDLIPADSRVEIDAKITAGYFTAGKRMTEEELSAVQGRLWEDIDH*
>VDBG_00006T0
MSDPSASAITDHVGLGDVLSTLKSIQLTQASLVTAVESLSRTVPQAATGATIDARSAGPN
DLDQSLDSNNVADLRASQHHVATSEGPELQAPAVPSSPEQRSGFTSRIVLTPDFTNTEPA
SRIGPFPQWGDEKKIVAMDPWGHLAPWLFKDTIENENVDIRPTIAITKAHMKLPELAESV
KSGRL*
>VDBG_00007T0
MGKRKSSSKPQGPKKKDPLPELFPCLFCNHEDAVKPKVDKKSGVGNLSCKVCGQTFQCSI
NYLSAPVDVYSEWVDAADHVSSKQKAVASGLSQGLVTRRMERPIEERDDEGIVADDDEY*

Bad example that needs re-formatting of headers:

>VDBG_00001T0 | VDBG_00001 | Verticillium albo-atrum VaMs.102 galactosyl transferase GMA12/MNN10 family protein (330 aa)
MHGYHHYIATNQAVGDLIENEADRRPQGAWTKPAYLLSLIVAELEKPEDERLEWIFWFDA
DTVVVNPSTPLEVFLPPKSDEDLTSVHLLIAANWDGLNSGAFALRVHPWSVSLLSAVLAY
PIYMSGRTGKDRFRDQSAFQYLLQDDKSPLANSYTKGKEHWATVPMRWFNALPVNNAFSK
NGQGWLFGKKMEGALFDNGTTEIYDDGNGGKIQPWKIMQGDMIVHFAGTTAGGTRDSWMG
PWLDRVEALLPEWNNVTTQHRLRDETAKFWSETSARISSEKAIADAKMKLDAEKKAAADK
AAEAKKAEEERKKAEEEKKKADEEKQPMD*
>VDBG_00002T0 | VDBG_00002 | Verticillium albo-atrum VaMs.102 abhydrolase domain-containing protein (307 aa)
MTRYKSRPSLLGRIIHQAMIIRQGRSFSTSTKAHLKLAYELYEPSSSRAIGHDSHPIIFL
HGLFGSKKNNRSISKVLARDLGRPVFALDLRNHGESPHDRHHDYTSMASDVAGFIIDHNL
DEPTIIGHSMGAKTAMALALRSPDLVRNIISVDNAPVDAVLESGFGNYVEGMKRIERAGV
MRQAEADDILKNHEESLPVRQFLLANLYRPQPNKPQQFRVPLDILGRSLGHMADFPFKNP
EETRFEKPALFIRGTRSKYVADDVLPLIGQFFPRFRLIDVDAGHWLISEKPEAFREAVVD
FLSTSK*

1. Run BLAT

Use the repeat masked version of your target genome (if available):

blat -noHead -minIdentity=92 genome.masked.fa ests.fa ests.psl

2. Filter Alignments

pslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap ests.psl ests.f.psl
cat ests.f.psl | sort -n -k 16,16 | sort -s -k 14,14 > ests.fs.psl

3. Create Hints

blat2hints.pl --in=ests.fs.psl --out=ests.E.hints --minintronlen=35 --trunkSS

This tutorial was created by Katharina Hoff, last update May 18th 2012. No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de