Recent Changes - Search:



About PmWiki

edit SideBar


Creating Hints for AUGUSTUS from Protein Sequences

Do not confuse proteins with peptides! This page describes the generation of hints from full-length proteins.

  • Input: Protein fasta file and genome fasta file
  • Output: AUGUSTUS hints

Software requirements:

This protocol was tested with the following versions:

Please read the installation instructions of each toolkit carefully and follow their advice!

0. Correctly format your fasta headers

Before actually starting to follow this tutorial, please check that the fasta headers of your genome and protein files are short and unique and do not contain whitespaces!

Positive example:


Bad example that needs re-formatting of headers:

>VDBG_00001T0 | VDBG_00001 | Verticillium albo-atrum VaMs.102 galactosyl transferase GMA12/MNN10 family protein (330 aa)
>VDBG_00002T0 | VDBG_00002 | Verticillium albo-atrum VaMs.102 abhydrolase domain-containing protein (307 aa)

1. Split protein file

NCBI-BLAST version 2.2.24 throws an error if the input query file is too large. Therefore, we split the protein file in smaller batches: protein.fa --minsize=20000 --outputpath=split_prot_directory

2. Create a list of potential matches

Attention: this step is optional! We use BLAST only to speed up exonerate, later. If you encounter problems with executing this step, just skip it and run exonerate, directly!

Using BLAST to create a list of potential queries and targets that have matches prior actually running exonerate saves a lot of time.

Run BLAST (formatted as a Sun Grid Engine Script where nFiles needs to be replaced by the actual number of files residing in split_prot_directory):

#$ -S /bin/bash
#$ -cwd
#$ -m e
#$ -t 1-nFiles
tblastn -query protein.split.$SGE_TASK_ID.fa -subject genome.fa > blast.$SGE_TASK_ID.out

Subsequent running BLAST, you should check whether it finished without errors:

# check for errors
grep "Assertion failed" blast.*

Continue if no such errors are reported. Otherwise, split protein file in even smaller batch files and try again.

Create indices for protein and genome files:

cdbfasta protein.fa
cdbfasta genome.fa

Produce a list of potential query-target combinations:

# create list with list of genomic contigs matching each protein, discard all hits with e-value > 1*10^-5
cat blast.* | > tblastn.matches
cat tblastn.matches | perl -e 'while(<>){split; if ($q eq $_[0]){$t .= "\t$_[1]"} else {print "$q$t\n"; $t="\t$_[1]";$q=$_[0];}} print "$q$t\n";' > tblastn.matchlists

Split the matchlist into smaller files in order to run exonerate on a Sun Grid Engine in parallel mode:

split -l 1000 tblastn.matchlists -d

3. Run Exonerate

You can either sequentially run exonerate in a bash loop, or wrap it in a SGE array script. The central command is (the exonerate binary must be in your path!): x00 x00 protein.fa.cidx genome.fa.cidx

You'll need to run this command for all the x00, x01, ... files that were produced by the split command.

Subsequently, concatenate the results:

for i in 00 01 ...
        cat exonerate.x$i.out >> exonerate.out

4. Create hints --in=exonerate.out --source=P --out=exonerate.hints

This tutorial was created by Katharina Hoff, last update May 18th 2012. No warranty for completeness or ability to run. No responsibility for links to external web pages. Contact:

Edit - History - Print - Recent Changes - Search
Page last modified on April 23, 2014, at 03:59 PM