Bioinformatics Greifswald | Augustus / IncorporateProteins

Creating Hints for AUGUSTUS from Protein Sequences

Do not confuse proteins with peptides! This page describes the generation of hints from full-length proteins.

Input: Protein fasta file and genome fasta file
Output: AUGUSTUS hints

Software requirements:

This protocol was tested with the following versions:

Exonerate version 2.2.0 (available at http://www.ebi.ac.uk/~guy/exonerate/)
NCBI-BLAST tblastn version 2.2.24 (available at ftp://ftp.ncbi.nih.gov/blast/)
cdbfasta version 0.99 (available at http://sourceforge.net/projects/cdbfasta/)
Perl verson 5.10.1 (available at http://www.perl.org/get.html)
Ubuntu (any unix system with a bash will do)
Custom perl scripts: splitMfasta.pl, allBlastMatches_ncbi-blast.pl, runExonerate.pl, exonerate2hints.pl (some available at http://bioinf.uni-greifswald.de/augustus/binaries/scripts/, others available on request from augustus-web@uni-greifswald.de)

Please read the installation instructions of each toolkit carefully and follow their advice!

0. Correctly format your fasta headers

Before actually starting to follow this tutorial, please check that the fasta headers of your genome and protein files are short and unique and do not contain whitespaces!

Positive example:

>VDBG_00005T0
MRLIPIHERVELPEDLIPADSRVEIDAKITAGYFTAGKRMTEEELSAVQGRLWEDIDH*
>VDBG_00006T0
MSDPSASAITDHVGLGDVLSTLKSIQLTQASLVTAVESLSRTVPQAATGATIDARSAGPN
DLDQSLDSNNVADLRASQHHVATSEGPELQAPAVPSSPEQRSGFTSRIVLTPDFTNTEPA
SRIGPFPQWGDEKKIVAMDPWGHLAPWLFKDTIENENVDIRPTIAITKAHMKLPELAESV
KSGRL*
>VDBG_00007T0
MGKRKSSSKPQGPKKKDPLPELFPCLFCNHEDAVKPKVDKKSGVGNLSCKVCGQTFQCSI
NYLSAPVDVYSEWVDAADHVSSKQKAVASGLSQGLVTRRMERPIEERDDEGIVADDDEY*

Bad example that needs re-formatting of headers:

>VDBG_00001T0 | VDBG_00001 | Verticillium albo-atrum VaMs.102 galactosyl transferase GMA12/MNN10 family protein (330 aa)
MHGYHHYIATNQAVGDLIENEADRRPQGAWTKPAYLLSLIVAELEKPEDERLEWIFWFDA
DTVVVNPSTPLEVFLPPKSDEDLTSVHLLIAANWDGLNSGAFALRVHPWSVSLLSAVLAY
PIYMSGRTGKDRFRDQSAFQYLLQDDKSPLANSYTKGKEHWATVPMRWFNALPVNNAFSK
NGQGWLFGKKMEGALFDNGTTEIYDDGNGGKIQPWKIMQGDMIVHFAGTTAGGTRDSWMG
PWLDRVEALLPEWNNVTTQHRLRDETAKFWSETSARISSEKAIADAKMKLDAEKKAAADK
AAEAKKAEEERKKAEEEKKKADEEKQPMD*
>VDBG_00002T0 | VDBG_00002 | Verticillium albo-atrum VaMs.102 abhydrolase domain-containing protein (307 aa)
MTRYKSRPSLLGRIIHQAMIIRQGRSFSTSTKAHLKLAYELYEPSSSRAIGHDSHPIIFL
HGLFGSKKNNRSISKVLARDLGRPVFALDLRNHGESPHDRHHDYTSMASDVAGFIIDHNL
DEPTIIGHSMGAKTAMALALRSPDLVRNIISVDNAPVDAVLESGFGNYVEGMKRIERAGV
MRQAEADDILKNHEESLPVRQFLLANLYRPQPNKPQQFRVPLDILGRSLGHMADFPFKNP
EETRFEKPALFIRGTRSKYVADDVLPLIGQFFPRFRLIDVDAGHWLISEKPEAFREAVVD
FLSTSK*

1. Split protein file

NCBI-BLAST version 2.2.24 throws an error if the input query file is too large. Therefore, we split the protein file in smaller batches:

splitMfasta.pl protein.fa --minsize=20000 --outputpath=split_prot_directory

2. Create a list of potential matches

Attention: this step is optional! We use BLAST only to speed up exonerate, later. If you encounter problems with executing this step, just skip it and run exonerate, directly!

Using BLAST to create a list of potential queries and targets that have matches prior actually running exonerate saves a lot of time.

Run BLAST (formatted as a Sun Grid Engine Script where nFiles needs to be replaced by the actual number of files residing in split_prot_directory):

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -m e
#$ -t 1-nFiles
tblastn -query protein.split.$SGE_TASK_ID.fa -subject genome.fa > blast.$SGE_TASK_ID.out

Subsequent running BLAST, you should check whether it finished without errors:

# check for errors
grep "Assertion failed" blast.*

Continue if no such errors are reported. Otherwise, split protein file in even smaller batch files and try again.

Create indices for protein and genome files:

cdbfasta protein.fa
cdbfasta genome.fa

Produce a list of potential query-target combinations:

# create list with list of genomic contigs matching each protein, discard all hits with e-value > 1*10^-5
cat blast.* | allBlastMatches_ncbi-blast.pl > tblastn.matches
cat tblastn.matches | perl -e 'while(<>){split; if ($q eq $_[0]){$t .= "\t$_[1]"} else {print "$q$t\n"; $t="\t$_[1]";$q=$_[0];}} print "$q$t\n";' > tblastn.matchlists

Split the matchlist into smaller files in order to run exonerate on a Sun Grid Engine in parallel mode:

split -l 1000 tblastn.matchlists -d

3. Run Exonerate

You can either sequentially run exonerate in a bash loop, or wrap it in a SGE array script. The central command is (the exonerate binary must be in your path!):

runExonerate.pl x00 x00 protein.fa.cidx genome.fa.cidx

You'll need to run this command for all the x00, x01, ... files that were produced by the split command.

Subsequently, concatenate the results:

for i in 00 01 ...
do
        cat exonerate.x$i.out >> exonerate.out
done

4. Create hints

exonerate2hints.pl --in=exonerate.out --source=P --out=exonerate.hints

This tutorial was created by Katharina Hoff, last update May 18th 2012. No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact: augustus-web@uni-greifswald.de