Navigate to Using Scipio. Training AUGUSTUS. Predicting Genes. AUGUSTUS-PPX.

Lab Session on Gene Prediction with AUGUSTUS

Workshop on Comparative Genomics, Jan. 14th 2011, Mario Stanke

In this lab session we practice the most common Bioinformatics steps when predicting the protein-coding genes in a eukaryotic genome with AUGUSTUS. We will assume the case of a "new" genome, for which AUGUSTUS has not been trained before, but will use well-studied species as examples because example data is readily available and visualization is easier.

Styles

Assignments are in this color. The lazy ones may go through very fast through this tutorial by just reading these assignments and cutting and pasting the commands that follow them (more or less).

Results are in this color.

[+] Details are hidden...

Example Data

All example files are in the data directory. We recommend you work directly in this directory.

Drosophila melanogaster (Exercises 1-5)

Human (Exercises 6)

For Cheaters: Result Files

You can use the files in the results directory to catch on if you are behind or to compare your results.

Exercise 1: Compile a Training Set

There are several typical options for creating a training set to estimate the parameters of gene finders. We will here go through option 4:

Spliced alignments of protein sequences

We assume that we have a set of protein sequences of the same or a very closely related species and will use Scipio to infer the gene structures.
  1. Follow the tutorial on "Using Scipio to create a training set" and create a training set genes.gb.
  2. Partition genes.gb into a training set and a holdout test setas described in 1.2 Split gene structure set....

Exercise 2: Train the Coding Regions of AUGUSTUS

Let's name our species "bug". Pretending that there was not already a parameters set of AUGUSTUS for Drosophila (named "fly"), we will estimate the parameters from the training set.
  1. Create a meta parameters file for bug as described in 2. CREATE A META PARAMETERS FILE...
  2. Estimate the parameters using your training set as described in 3. MAKE AN INITIAL TRAINING

Exercise 3: Ab Initio Predict Genes in the Genome

  1. Predict the protein-coding parts of the genes in a sample sequence of Drosophila melanogaster as described in 1. PREDICT GENES AB INITIO.
  2. Visualize your predicted genes as decribed in 2. MAKE A CUSTOM GENE PREDICTION TRACK....

Exercise 4: Prepare hints

Construct extrinsic evidence about genes from transcriptome data (ESTs and RNA-Seq) following the intructions in 3. PREPARE HINTS.

Exercise 5: Predict Genes Using Hints

Structurally annotate an example sequence from Drosophila based on the hints from exercise 4 by
  1. setting the hint parameters
  2. predicting genes using hints

Exercise 6: Identify Members of a Protein Family

Use the new PPX-Extension of AUGUSTUS to find the gene structures based on a multiple alignment of a protein family as described in AUGUSTUS-PPX.