AUGUSTUS for transcriptome

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

AUGUSTUS for transcriptome

Post by katharina »

Originally posted in the old forum by suzuki on 22.07.2012 - 07:05

The ciliate Paramecium tetraurelia is closely related to Tetrahymena thermophilia (translation_table 6).
The Genbank file for Paramecium tetraurelia is available at
http://paramecium.cgm.cnrs-gif.fr/downl ... n_v1.gb.gz
I would like to detect protein-coding regions in transcripts of Paramecium tetraurelia, available at
ftp://ftp.ncbi.nih.gov/repository/UniGe ... eq.uniq.gz
The transcripts may include incomplete (truncated) sequences.
I was wondering if AUGUSTUS can be used to predict protein-coding regions in transcripts as well as genomes, and if there are any tutorials
for this purpose.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 23.07.2012 - 17:46
I haven't tried or evaluated augustus for predicting the ORF in an mRNA fragment, but here is my suggestion.

Code: Select all

augustus --species=tetrahymena --genemodel=intronless
augustus deals with truncated genes by default, so that is not a problem. A CDS may be both left- and right-truncated.
Also the non-standard genetic code is no problem
(--translation_table=6 is the default for tetrahymena).
I created the option --genemodel=intronless for eukaryotic species that have (almost no) introns. There is a slight
principal problem as with this option you may get more than 1 ORF in some mRNAs, as augustus "thinks" it is actually
a genomic sequence. This may not be very relevant practically, if you pick
the best (longest?) one in case there is more than 1.
Be aware that augustus cannot predict frameshifts in case your transcriptome sequences contain those errors.
It would be nice to get your feedback as I already had other people asking that question.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 24.07.2012 - 07:50

Code: Select all

  augustus --species=tetrahymena --genemodel=intronless Pte.seq.uniq > augustus.abinitio.gff
  getAnnoFasta.pl augustus.abinitio.gff
augustus predicted 3588 CDS in 3574 of the 5230 mRNAs (Pte.seq.uniq), and thus I got more than 1 ORF in some of the mRNAs. Do you
have any script to pick the longest ORF in case there is more than 1?
All of the predicted CDS contain both start_codon and stop_codon. Thus,
I wonder if augustus deals with truncated genes by default (augustus
may fail to predict CDS that are left- and/or right-truncated).
Can augustus use parameters trained with Paramecium tetraurelia (ptetraurelia_annotation_v1.gb) instead of Tetrahymena thermophilia?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 24.07.2012 - 16:57
I happened to just find an error that only happens in the case of --species=tetrahymena and should result in nonsense predictions, or let's
rather say, undefined behavior.
This only concerns tetrahymena as this uses different pattern sizes for the
non-coding and intron Markov chains.
Please replace igenicmodel.cc of the current release 2.6.1 with this file
http://bioinf.uni-greifswald.de/augustu ... icmodel.cc
and then rerun your predictions. Sorry for that.

About the truncation: I was also wrong, here. Augustus predictions only initial,

Code: Select all

 internal and terminal truncated exons, not single exons (because it cannot know
 whether there are more exons in this case). The result is that truncation works
 not in combination with 
--genemodel=intronless
A workaround is to use the default genemodel (partial) and
make introns very unlikely, e.g. by giving a big malus to introns unsupported
by evidence and an empty hints file. More later...
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 24.07.2012 - 18:35
You prevent the predictions of introns (while maintaining the ability to predict truncated CDS) like this:

Code: Select all

cp ~/augustus/trunks/config/extrinsic/extrinsic.cfg .
touch hints.gff
edit the intron lines in extrinsic.cfg so it looks like this

Code: Select all

 intronpart        1  1e-100  M    1  1e+100
     intron        1  1e-100  M    1  1e+100
This means that any unsupported intron (=any intron) is punished with a factor of 10^{-100} or less, so it is effectively prevented.
Then run

Code: Select all

augustus --species=tetrahymena Pte.seq.uniq --extrinsicCfgFile=extrinsic.cfg --hintsfile=hints.gff > augustus.abinitio.gff
This seemed to work for me.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 25.07.2012 - 03:17
~/augustus/src/igenicmodel.cc was replaced accordingly.

Code: Select all

  cp ~/augustus/config/extrinsic/extrinsic.cfg .
and edited the intron lines in extrinsic.cfg accordingly.

Code: Select all

  touch hints.gff
Then running

Code: Select all

  augustus --species=tetrahymena Pte.seq.uniq --extrinsicCfgFile=extrinsic.cfg --hintsfile=hints.gff > augustus.abinitio.gff
predicted short ORF (e.g. protein sequence = [VK] for name = gnl|UG|Pte#S38274454), and terminated prematurely with a segmentation fault.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 26.07.2012 - 11:05
I had tried the commands from my last post above on your data and it worked. What version of augustus are you using?
Have you replaced igenicmodel.cc?
Can you please try this:
Edig the Makefile so that the CFLAGS include -g -ggdb
Please recompile then with

Code: Select all

make clean all 
and run your command in the debugger.

Code: Select all

gdb augustus
set args --species=tetrahymena Pte.seq.uniq --extrinsicCfgFile=extrinsic.cfg --hintsfile=hints.gff
r
(wait for segfault)
bt
and post the backtrace.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 26.07.2012 - 12:05
I am using augustus version 2.6.1 (http://bioinf.uni-greifswald.de/augustu ... 6.1.tar.gz).
'diff' command can't find any differences between two files: igenicmodel.cc of the current release 2.6.1 and this file (http://bioinf.uni-greifswald.de/augustu ... icmodel.cc)
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 26.07.2012 - 18:54
Please double check. There should be one:

Code: Select all

wget http://bioinf.uni-greifswald.de/augustus/binaries/augustus.2.6.1.tar.gz  
wget http://bioinf.uni-greifswald.de/augustus/binaries/species/igenicmodel.cc
tar xzf augustus.2.6.1.tar.gz
diff src/igenicmodel.cc igenicmodel.cc 
27d26                                                                                                                                                                                                                                                                          
<                                                                                                                                                                                                                                                                              
70,73c69,74                                                                                                                                                                                                                                                                    
<       || IntronModel::GCemiprobs[gcIdx].probs.size()==0) // this happens for intronless species                                                                                                                                                                              
<     emiprobs = GCemiprobs[gcIdx];                                                                                                                                                                                                                                            
<   else // use the intron content model                                                                                                                                                                                                                                       
<     emiprobs = IntronModel::GCemiprobs[gcIdx];                                                                                                                                                                                                                               
---                                                                                                                                                                                                                                                                            
>       || IntronModel::GCemiprobs[gcIdx].probs.size()==0 // this happens for intronless species                                                                                                                                                                               
>       || IntronModel::k != k) {                                                                                                                                                                                                                                              
>       emiprobs = GCemiprobs[gcIdx];                                                                                                                                                                                                                                          
>   } else {// use the intron content model                                                                                                                                                                                                                                    
>       emiprobs = IntronModel::GCemiprobs[gcIdx];                                                                                                                                                                                                                             
>   }
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 27.07.2012 - 03:00
It worked when I compiled (cd src; make) after replacing igenicmodel.cc.
augustus predicted 4048 CDS in 4011 of the 5230 mRNAs (Pte.seq.uniq). Of the 4048 CDS, 2723 contain start_codon, and 3773 contain stop_codon.
Do you have any script to pick the longest ORF in case there is more than 1 ORF in mRNA?
Can augustus use parameters trained with Paramecium tetraurelia (ptetraurelia_annotation_v1.gb) instead of Tetrahymena thermophilia?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 27.07.2012 - 15:07
I am glad, it seemed to work out. Thanks for hanging in.
I don't have a script for you that picks the longest/beest ORF.
You have to train it yourself if you want species specific parameters for Paramecium tetraurelia. Please have a look at the tutorial and the other docs.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 28.07.2012 - 05:54
Because I didn't understand 'Edig the Makefile so that the CFLAGS include -g -ggdb', I didn't do it.
Then, augustus worked on Linux, but augustus terminated prematurely and printed 'Segmentation fault: 11' on Mac OS 10.7.4.
On Mac OS 10.7.4, running command in the debugger printed the following backtrace.

Code: Select all

 # ----- prediction on sequence number 1576 (length = 469, name = gnl|UG|Pte#S38273014) -----
 #
 # Constraints/Hints:
 # (none)
 # Predicted genes for sequence number 1576 on both strands

 Program received signal EXC_BAD_ACCESS, Could not access memory.
 Reason: KERN_INVALID_ADDRESS at address: 0x00000001017ffffc
 SequenceFeatureCollection::localSSMalus (this=0x7fff5fbfe330, type=dssF, pos=-1, strand=minusstrand) at types.hh:319
 319		return value & SET_FLAG(n);
 (gdb) bt
 #0  SequenceFeatureCollection::localSSMalus (this=0x7fff5fbfe330, type=dssF, pos=-1, strand=minusstrand) at types.hh:319
 #1  0x000000010006d8a5 in ExonModel::notEndPartEmiProb (this=0x7fff5fbfe330, beginOfStart=3, right=1606409088, frameOfRight=1606409088, 
     exonparts=0x0) at exonmodel.cc:1431
 #2  0x000000010006ab9b in ExonModel::viterbiForwardAndSampling (algovar=doViterbiOnly, this=0x7fff5fbfdeb0, viterbi=@0x7fff5fbfdeb0, 
     forward=@0x7fff5fbfdeb0, state=1606409904, base=1606409904, oli=@0x7fff5fbfdf18) at exonmodel.cc:989
 #3  0x000000010002b8a4 in NAMGene::viterbiAndForward (this=0x7fff5fbfdfb0, dna=0x7fff5fbfdfb0 "`?_?", useProfile=false)
     at namgene.cc:260
 #4  0x000000010002c393 in NAMGene::findGenes (this=0x7fff5fbfe060, dna=0x7fff5fbfe060 "??_?", strand=1606410336) at namgene.cc:774
 #5  0x000000010002cc44 in NAMGene::getStepGenes (this=0x7fff5fbfe0f0, annoseq=0x7fff5fbfe0f0, sfc=@0x7fff5fbfe330, strand=1606410480, 
     onlyViterbi=true) at namgene.cc:699
 #6  0x000000010002d4e5 in NAMGene::doViterbiPiecewise (this=0x7fff5fbfe7c0, sfc=@0x7fff5fbfe7c0, annoseq=0x7fff5fbfe7c0, 
     strand=1606412224) at namgene.cc:599
 #7  0x0000000100002954 in Gene::destroyGeneSequence () at /Users/suzuki/augustus/include/gene.hh:397
 #8  0x0000000100002954 in predictOnInputSequences (seq=0x1002d5d00, namgene=@0x7fff5fbfe9b0, extrinsicFeatures=@0x7fff5fbfe9b0, 
     strand=1606412720) at augustus.cc:399
 #9  0x0000000100003b56 in AnnoSequence::deleteSequence () at /Users/suzuki/augustus/include/gene.hh:189
 #10 0x0000000100003b56 in main (argc=2, argv=0x7fff5fbff820) at augustus.cc:190
 (gdb) 
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 29.07.2012 - 10:05
According to the manual "Training AUGUSTUS" (http://bioinf.uni-greifswald.de/augustu ... ining.html),
I trained with Paramecium tetraurelia (ptetraurelia_annotation_v1.gb) on Linux.

Code: Select all

 randomSplit.pl ptetraurelia_annotation_v1.gb 100
 grep -c LOCUS ptetraurelia_annotation_v1.gb*
 # ptetraurelia_annotation_v1.gb:697
 # ptetraurelia_annotation_v1.gb.test:100
 # ptetraurelia_annotation_v1.gb.train:597
 new_species.pl --species=paramecium
According to 'SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE',
I edited the parameter file ($AUGUSTUS_CONFIG_PATH/species/paramecium/paramecium_parameters.cfg).

Code: Select all

translation_table 6
 /Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
 /Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
 /Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid
Then, I made an initial training.

Code: Select all

 etraining --species=paramecium ptetraurelia_annotation_v1.gb.train
 ls -ort $AUGUSTUS_CONFIG_PATH/species/paramecium/
 augustus --species=paramecium ptetraurelia_annotation_v1.gb.test | tee firsttest.out   # takes 17m
 grep -A 22 Evaluation firsttest.out
Here is the accuracy report at the end of firsttest.out.

Code: Select all

 *******      Evaluation of gene prediction     *******

 ---------------------------------------------
                  | sensitivity | specificity |
 ---------------------------------------------|
 nucleotide level |       0.803 |       0.964 |
 ---------------------------------------------/

 ----------------------------------------------------------------------------------------------------------
            |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
            | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
            | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
 ----------------------------------------------------------------------------------------------------------|
            |        |        |      |              11091 |              18586 |             |             |
 exon level |  14148 |  21643 | 3057 | ------------------ | ------------------ |       0.141 |       0.216 |
            |  14148 |  21643 |      | 7222 | 3434 |  435 | 7621 | 6814 | 4151 |             |             |
 ----------------------------------------------------------------------------------------------------------/

 ----------------------------------------------------------------------------
 transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
 ----------------------------------------------------------------------------|
 gene level |  5531 |  6666 |  739 | 4792 | 5927 |       0.111 |       0.134 |
 ----------------------------------------------------------------------------/
At gene level, sensitivity and specificity are 0.111 and 0.134, respectively (of the 6666 genes, 739 were predicted exactly).
When the tetrahymena parameters were used for the test,
augustus --species=tetrahymena ptetraurelia_annotation_v1.gb.test
at gene level, sensitivity and specificity are 0.0101 and 0.0311, respectively (of the 6666 genes, 67 were predicted exactly).
When the self-trained paramecium parameters were used for predicting ORF in mRNAs of Paramecium tetraurelia (Pte.seq.uniq),

Code: Select all

 augustus --species=paramecium Pte.seq.uniq --extrinsicCfgFile=extrinsic.cfg --hintsfile=hints.gff > augustus.abinitio.gff
augustus predicted 5146 CDS in 5102 of the 5230 mRNAs. Of the 5146 CDS, 3668 contain start_codon, and 4747 contain stop_codon.
Thus, the self-trained paramecium parameters seemed to give better results than the tetrahymena parameters.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by Mario on 03.09.2012 - 15:28
As far as the segmentation fault is concerned, I believe I fixed it in the current working code. Thanks for sending the backtrace. In the next update it should not happen anymore.
By 'Edit the Makefile so that the CFLAGS include -g -ggdb'
I meant that the options -g -ggdb are already in the line defining CFLAGS, but they are behind the #-sign which means that everything to the right is just a comment. Moving this somewhere into the list of options and recompiling would have enabled debugging information. too.
Your accuracy results for paramecium are very suspiciously low. According to my experience, that low exon level sensitivity means that something is principally wrong. In particular as the base level results seem ok. Possibly something as all introns are wrong because the training set didn't include any or something related to the stop codons. Hard to guess what it could be without looking at examples, though.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 18.09.2012 - 04:38
I trained with the Genbank file for Paramecium tetraurelia, available at
http://paramecium.cgm.cnrs-gif.fr/downl ... n_v1.gb.gz
Could you reproduce the same accuracy results?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: AUGUSTUS for transcriptome

Post by katharina »

by suzuki on 06.06.2013 - 13:06
I was wondering how I could provide you with examples?
Post Reply