Augustus on WGS and RNA-Seq data

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Augustus on WGS and RNA-Seq data

Post by katharina »

Originally posted in the old forum by Raj Sasidharan on 27.02.2013 - 17:34
Hello Augustus-users,
I am interested in using Augustus and have the following data. We have the genome sequence of a plant species that was assembled from next-gen sequencing data. The species that we are interested in does not have known cDNA or protein sequences although there are some sequences from related species although these appear mostly organelle-encoded. Most importantly, we have RNA-Seq transcriptome data for this species. Given this, I am wondering if the best case scenario would be to incorporate the RNA-seq data into Augustus with a short-read splice-aware aligner like GSNAP to predict gene structures on the repeat-masked genome sequence. Would using a plant (say, Rice) training set provide better predictions? Any other suggestions on ideal approach to obtain accurate gene structures would be much appreciated. For instance, is it worth creating hints from a handful of cDNA and protein sequences available for the species?
Thanks,
Raj
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by katharina on 28.02.2013 - 09:21
Before incorporating RNA-Seq data, you'll need an AUGUSTUS parameter set that suits your species. Maybe, if you're lucky, one of the existing parameter sets will be okay. But even in order to test that, it would be helpful to have a small "test gene set" that was not predicted by AUGUSTUS, but e.g. generated from alignments of cDNA or proteins to your new genome.
You could try to generate training gene structures using e.g. CEGMA with the core protein set. Train AUGUSTUS with these gene structures (no UTR training!). Or you could try to feed assembled RNA-Seq data into PASA instead of ESTs (also most likely no UTR training!) and train AUGUSTUS on the resulting gene structures.
After obtaining a suitable parameter set, incoporate intron hints from your RNA-Seq data into the gene predictions.
In general, it might help to include as much extrinsic evidence information as possible (i.e. add those handful of cDNA and portein sequences).
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by Raj Sasidharan on 01.03.2013 - 13:35
Thanks Katharina!
Could you point me to where I can find existing parameter sets and a README or tutorial about using CEGMA to generate training gene sets and also obtain the core protein set? I would want to run Augustus locally.
Raj
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by Diptarka on 18.06.2013 - 11:24
gene prediction
Hi, I am new to Augustus for gene prediction. Suppose i have Illumina sequenced data for an yeast species and i know the genus of the same via 18S RDNA. How, would i find genes in it using Augustus. Could anyone guide me stepwise through it?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by katharina on 18.06.2013 - 21:59
At first, you need to check whether AUGUSTUS has been trained for your species, or for a close relative of your species (see http://bioinf.uni-greifswald.de/webaugu ... p#param_id). If yes -> proceed to work with the existing parameter set. If not -> you will need more than the Illumina reads an the genome in order to train AUGUSTUS (e.g. proteins from one of the other yeast species, or ESTs from your target species, or maybe even assembled RNA-Seq data instead of the raw reads). Submit your data for training parameters to http://bioinf.uni-greifswald.de/webaugu ... ing/create
After successful training, download the parameters to your local computer and place them into the config/species directory of your local AUGUSTUS installation.
Subsequently, pick your favorite aligner from BLAT, GSNAP and Tophat, and proceed to work according to one of the tutorials listed at http://bioinf.uni-greifswald.de/bioinf/ ... s.Augustus
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by Diptarka on 22.06.2013 - 08:00
Dear Katharina,
The files that i have are in fastq format i.e the raw reads.How do i actually create a training data set for the same.
i mean the training data file? As per the tutorial, augustus accepts them in .gb format right? how can raw reads be trained?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

by katharina on 23.06.2013 - 12:06
We do not have a standard procedure.
Sometimes, RNA-Seq assemblies produced with Cufflinks will be sufficient for usage as ESTs (i.e. you can feed them into our pipeline instead of ESTs). Sometimes that won't work very well though, it depends on the assembly quality.
If RNA-Seq assemblies yield a low number of training gene structures, you may consider to combine them with training genes produced by other techniques, e.g. from protein sequence to genome alignments.
It may be a rather challenging task for a person who is inexperienced in bioinformatics.
Katharina
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: Augustus on WGS and RNA-Seq data

Post by katharina »

We now recommend the usage of BRAKER1!
Post Reply