Navigate to Lab Session on AUGUSTUS. Using Scipio. Training AUGUSTUS. AUGUSTUS-PPX.

Show all / no details.

Predicting Genes with AUGUSTUS

This tutorial describes various typical settings for predicting genes with AUGUSTUS.

1. PREDICT GENES AB INITIO

Ab initio prediction means that no other input is used than the target genome itself. Below, you will find examples of predictions that use evidence (hints), here we use none. Predict the genes in the range 7,000,001-7,500,000 of chr2R of D. melanogaster. Use the FASTA format file chr2R.fa, which includes the whole chromosome 2R. To shorten this test run (or when running several jobs in parallel) you should specify a subrange of the input sequence using the parameters --predictionStart and --predictionEnd.

augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa > augustus.abinitio.gff   # takes ~1m

In this example, I am using the fly parameters for comparability whith predictions below. Of course, the self-trained bug parameters also work. The output file augustus.abinitio.gff now contains the predicted gene structures in GFF format with additional comments (lines starting with #).

# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS   gene        7007533 7010935 0.02  - .  g1
chr2R AUGUSTUS   transcript  7007533 7010935 0.02  - .  g1.t1
chr2R AUGUSTUS   tts         7007533 7007533 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7007533 7008630 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   stop_codon  7007610 7007612 .     - 0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7008631 7008694 1     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7008812 7008865 0.88  - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7009192 7009251 0.95  - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7007610 7008630 1     - 1  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7008695 7008811 0.88  - 1  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7008695 7008811 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7008866 7009191 0.99  - 0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7008866 7009191 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7009252 7009353 0.95  - 0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7009252 7009429 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   start_codon 7009351 7009353 .     - 0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7010820 7010935 .     - .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   tss         7010935 7010935 .     - .  transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# end gene g1
...

For a description of the GFF format see the GFF definition at the Sanger Centre.

If you also want the protein sequences you can retrieve them with

getAnnoFasta.pl augustus.abinitio.gff

which extracts the peptide sequences into a file augustus.abinitio.aa:

>g1.t1
MNKLNLVLITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFR
EHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNINNEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNR
DNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSLNVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAG
VDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMKDMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEA
TYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGKRVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ
>g2.t1
MRHRNKGAVKRKGPSAGAEQEQELKKPKSEFSNGFKRYITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVT
...

2. MAKE A CUSTOM GENE PREDICTION TRACK ON THE UCSC GENOME BROWSER

In order to visually inspect our results and to compare with the FlyBase annotation we will now make a custom track of the gene structures in augustus.abinitio.gff. We need to create a few header lines in the custom track file which we can either do manually with an editor or like below on the command line (cut and paste).

echo -e "browser position chr2R:7000000-7050000\n\
browser hide multiz15way bdtnpChipper\n\
track name=abinitio description=\"Augustus ab initio predictions\" db=dm3 visibility=3" > abinitio.browser
grep -P "AUGUSTUS\tCDS" augustus.abinitio.gff >> abinitio.browser

With the grep command we just filtered out the lines that specify the coding exon coordinates.
The resulting file abinitio.browser is now in UCSC custom track format and looks like this:

browser position chr2R:7000000-7050000
browser hide multiz15way bdtnpChipper
track name=abinitio description="Augustus ab initio predictions" db=dm3 visibility=3
chr2R   AUGUSTUS   CDS  7007610 7008630 1     -   1  transcript_id "g1.t1"; gene_id "g1";
chr2R   AUGUSTUS   CDS  7008695 7008811 0.88  -   1  transcript_id "g1.t1"; gene_id "g1";
chr2R   AUGUSTUS   CDS  7008866 7009191 0.99  -   0  transcript_id "g1.t1"; gene_id "g1";
chr2R   AUGUSTUS   CDS  7009252 7009353 0.95  -   0  transcript_id "g1.t1"; gene_id "g1";
chr2R   AUGUSTUS   CDS  7011376 7012396 1     -   1  transcript_id "g2.t1"; gene_id "g2";
chr2R   AUGUSTUS   CDS  7012461 7012577 0.92  -   1  transcript_id "g2.t1"; gene_id "g2";
chr2R   AUGUSTUS   CDS  7012632 7012957 0.99  -   0  transcript_id "g2.t1"; gene_id "g2";
...

Now upload this file as a custom track on the UCSC genome browser.

[+] How again?

The lazy ones can look at the result by clicking this link to a previously prepared custom track.

3. PREPARE HINTS

Hints are extrinsic evidence about the location and struture of genes in a particular GFF format. When predicting genes AUGUSTUS can incorporate these hints, which will change the likelihood of gene structures candidates. It will tend to predict gene structures that are in agreement with the hints.

Sources of Hints

ESTs or mRNAs	transcriptome reads, long enough to span several exons (454, Sanger)
RNA-Seq	high coverage short read transcriptome sequences (Illumina, SOLiD)
genomic conservation
MSMS

Below, we will practice the preparation of hints from ESTs or mRNA (3.1) and from RNA-Seq (3.2).

3.1 From ESTs

As an example, we will use a set of 8458 ESTs which map to chr2R:7000000-8000000 of of Drosophila: est.chr2R.7M-8M.fa.

Align the ESTs against chr2R using BLAT.

blat -noHead chr2R.fa est.chr2R.7M-8M.fa est.psl # takes ~3m

This creates an alignment file est.psl in PSL format:

440 5 0  2  0  0  1   1281 - gi|1703783  447  0  447 chr2R  21146708  7776697 7778425 2   197,250,    0,197,     7776697,7778175,
494 3 0  1  0  0  2   65   + gi|1703784  498  0  498 chr2R  21146708  7775550 7776113 3   452,12,34,  0,452,464,x 7775550,7776003,7776079,
...

However, typically some ESTs align well to very many places in the genome. BLAT also includes short local alignments starting from 30bp. For this reason, we further filter the alignments:

cat est.psl | filterPSL.pl --best --minCover=80 > est.f.psl

est.f.psl now only contains for each query the best alginment(s) and that only if it covers at least 80% of the query length. This reduces the number of alignments:

wc -l est.psl est.f.psl
# 10487 est.psl
# 8606 est.f.psl

We can have a look at those EST alignments by creating another custom browser track:

echo -e "browser position chr2R:7000000-7050000\n\
track name=ESTs description=\"EST alignments\" db=dm3 visibility=4" > ests.browser
cat est.f.psl >> ests.browser
gzip ests.browser

You can now upload ests.browser.gz as another custom track or click on this prepared custom track

Next, generate hints from the EST alignments:

blat2hints.pl --nomult --in=est.f.psl --out=hints.est.gff

The script blat2hints.pl identifies the positions of likely introns, exons and exonic regions (termed exonpart or ep) from the alignments. The file hints.est.gff now contains these hints sorted by third column. However, they are internally grouped by the the group name following grp= in the last column. An example group of hints may look like this.
E.g.

grep 15058069 hints.est.gff

yields

chr2R   b2h     ep      7559574 7559803 0   .   .    grp=gi|15058069;pri=4;src=E
chr2R   b2h     ep      7560550 7560814 0   .   .    grp=gi|15058069;pri=4;src=E
chr2R   b2h     exon    7560222 7560347 0   .   .    grp=gi|15058069;pri=4;src=E
chr2R   b2h     intron  7559804 7560221 0   .   .    grp=gi|15058069;pri=4;src=E
chr2R   b2h     intron  7560348 7560549 0   .   .    grp=gi|15058069;pri=4;src=E

3.2 From RNA-Seq

Massive amounts of (short) transcriptome reads from next generation sequencing methods like Illumina first need to be aligned to the genome. Recently, a large number of short read aligners was developed. For the sake of this tutorial we will assume that we have already aligned the reads to the genome and that we have two files:

The file chr2R.7M-8M.wig contains a coverage graph, that contains for each base in the window chr2R:7000000-8000000 the number of reads alignments that cover the position. The UCSC group has a description of the wiggle format (.wig).
The file hints.rnaseq.intron.gff contains likely intron positions, inferred from gaps in the query of the read alignments. Together with the intron boundaries the multiplicity (mult) is reported, which counts the number of alignments that support the given intron candidate, if there is more than one.

In this readme about AUGUSTUS in the RGASP assessment a method is described that would produce two such files. TopHat is a spliced read mapper for RNA-Seq that also produces a coverage grap and reports introns with their multiplicity.

Upload the file chr2R.7M-8M.wig as a UCSC custom track (gziping would speed up upload) or click on this prepared custom track.

Generate hints about exonic regions from the coverage graph (wig file):

cat chr2R.7M-8M.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 \
 --src=W --type=ep --radius=4.5 > hints.rnaseq.ep.gff

The resulting hints.rnaseq.ep.gff now contains hints in a GFF format that augustus understands:

chr2R   w2h   ep    7007551 7007560 5.300   .    .    src=W;mult=5;
chr2R   w2h   ep    7007561 7007570 7.400   .    .    src=W;mult=7;
chr2R   w2h   ep    7007571 7007580 9.700   .    .    src=W;mult=9;
chr2R   w2h   ep    7007581 7007590 10.200  .    .    src=W;mult=10;

Again, at the end of the column, the multiplicity (mult) contains the average coverage in the given interval. augustus will consider each such line evidence that the sequence interval is part of an exon (ep=exonpart).

3.3 Concatenate all Hints

Now, join the hints from all sources into one file:

cat hints.est.gff hints.rnaseq.intron.gff hints.rnaseq.ep.gff > hints.gff

hints.gff now contains the exon, intron and exonpart hints from ESTs as well as the intron and exonpart hints from RNA-Seq.

4. SET HINT PARAMETERS

The strength of the influence of the hints can be adjusted with a few parameters from no influence (ab initio) up to the point where they should be trusted completely and "anchor" the gene structure. These parameters are stored in an extrinsic configuration file. The folder config/extrinsic of the AUGUSTUS package contains a few examples.

Start by copying another extrinsic configuration file:

cp $AUGUSTUS_CONFIG_PATH/extrinsic/extrinsic.M.RM.E.W.cfg extrinsic.bug.cfg

Now edit extrinsic.bug.cfg so that the non-comment lines are like this. Alternatively, you may just copy that file from the result files and edit some of the bold numbers below.

[SOURCES]
M E W

[SOURCE-PARAMETERS]

[GENERAL]
      start        1        1  M    1  1e+100  E 1    1    W 1    1
       stop        1        1  M    1  1e+100  E 1    1    W 1    1
        tss        1        1  M    1  1e+100  E 1    1    W 1    1
        tts        1        1  M    1  1e+100  E 1    1    W 1    1
        ass        1        1  M    1  1e+100  E 1    1    W 1    1
        dss        1        1  M    1  1e+100  E 1    1    W 1    1
   exonpart        1     .997  M    1  1e+100  E 1    1e2  W 1    1.05
       exon        1        1  M    1  1e+100  E 1    1e4  W 1    1
 intronpart        1        1  M    1  1e+100  E 1    1    W 1    1
     intron        1       .3  M    1  1e+100  E 1    1e6  W 1    1
    CDSpart        1  1 0.985  M    1  1e+100  E 1    1	   W 1    1
        CDS        1        1  M    1  1e+100  E 1    1    W 1    1
    UTRpart        1    1 .96  M    1  1e+100  E 1    1    W 1    1
        UTR        1        1  M    1  1e+100  E 1    1    W 1    1
     irpart        1        1  M    1  1e+100  E 1    1    W 1    1
nonexonpart        1        1  M    1  1e+100  E 1    1    W 1    1
  genicpart        1        1  M    1  1e+100  E 1    1    W 1    1

The bold green numbers specify the bonus that a gene structure candidate gets for being compatible with a hint of that type and source.

For example, the 1e6 in the intron row after the source letter E means that for each intron hint from ESTs (src=E), gene structures that have an intron with both boundaries given as in the hint are rewarded by a factor of 1 million relatively to gene structures disregarding the intron hint.

The 1.05 in the exonpart row after the letter W specifies that for each exonpart hint in the RNA-Seq hints file (src=W), every gene structure that has an exon including the range of the hint gets a relative bonus factor 1.05 per multiplicity.

The red numbers mean a punishment (malus) for gene structures with unsupported features. For example, the .3 in the intron row means that every intron candidate that has no intron hints supporting it is punished by multiplying its relative probability with the factor 0.3. If you decrease this number even more (say from .3 to .001) then fewer introns unsupported by spliced transcriptome reads should be predicted. This would likely decrease the false positive intron rate, but also more true unsupported introns would be missed.

For more information look at into one of the extrinsic.cfg files.

5. PREDICT GENES USING HINTS

Predict the genes in the range 7,000,001-7,500,000 of chr2R of D. melanogaster using evidence from hints.gff.

augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa \
 --extrinsicCfgFile=extrinsic.bug.cfg --hintsfile=hints.gff > augustus.hints.gff # takes ~9m

The species fly contains UTR parameters, which we didn't have the time to train for bug. When using RNA-Seq as hints it is better to use a model with UTRs, as a significant fraction of reads map to UTRs. It is also possible to use bug here, though.

The output augustus.hints.gff now looks like that

# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS   gene        7007533 7009385 0.2 -  .  g1
chr2R AUGUSTUS   transcript  7007533 7009385 0.2 -  .  g1.t1
chr2R AUGUSTUS   tts         7007533 7007533 .   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7007533 7008630 .   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   stop_codon  7007610 7007612 .   -  0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7008631 7008694 1   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7008812 7008865 1   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   intron      7009192 7009251 1   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7007610 7008630 1   -  1  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7008695 7008811 1   -  1  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7008695 7008811 .   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7008866 7009191 1   -  0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7008866 7009191 .   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   CDS         7009252 7009353 0.94-  0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   exon        7009252 7009385 .   -  .  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   start_codon 7009351 7009353 .   -  0  transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS   tss         7009385 7009385 .   -  .  transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 100
# CDS exons: 4/4
#      E:   4 
#      W:   4 
# CDS introns: 3/3
#      E:   3 
# 5'UTR exons and introns: 1/1
#      E:   1 
# 3'UTR exons and introns: 1/1
#      W:   1 
# hint groups fully obeyed: 137
#      E:   4 (gi|15542574,SRR023546.8642467/1)
#      W: 133 
# incompatible hint groups: 18
#      E:  18 (gi|13769068,gi|4203815,gi|15543927,gi|38623822,gi|15539951,gi|14693753,gi|14699170,...)
# end gene g1
...

After each predicted transcript a little statistics follows about the support and compatibility of this transcript with the hints. Note, that AUGUSTUS now predicts alternative splice forms (ending e.g. in .t2).

Finally, make another custom track with the predictions using hints:

echo -e "browser position chr2R:7299000-7318000\n\
track name=withhints description=\"Augustus predictions using hints\" db=dm3 visibility=3" > withhints.browser
grep -P "AUGUSTUS\t(CDS|exon)" augustus.hints.gff >> withhints.browser

In the last line we are filering out from the predictions just the lines specifying exons and CDS. The additional exon lines identify the UTR (if you used fly) by subtracting the CDS ranges.

Again, you may also just click on this like to a prepared custom track with the preditions.