extract gene sequence from AUGUSTUS predictions

Discussions about predicting genes with AUGUSTUS. Not covered here: WebAUGUSTUS and BRAKER1

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

extract gene sequence from AUGUSTUS predictions

Post by katharina »

Originally posted in the old forum by Matthew on 15.03.2012 - 10:55
Is there a best way to extract the gene nucleotide sequence from the GFF3 output of Augustus?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

by Mario on 15.03.2012 - 10:59
1. Run augustus with --codingseq=1 (and with --protein=1 if you like the amino acid sequence as well).
2. Extract the sequences from the comments in the resulting gff with getAnnoFasta.pl on the resulting gff file.
This script also allows you to retrieve the complete mRNA (if UTRs were predicted) when you provide the genome sequence
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

by Cecilia on 23.05.2012 - 21:30
What if I finished augustus prediction without --codingseq=1 and --protein=1? Is there a script to extract gene sequences from the gff3 and genome fasta file?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

by katharina on 23.05.2012 - 22:51
Not that I am aware of it. You'd either have to google a bit or write that script yourself, I guess.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

by Tom on 12.10.2012 - 15:31
I tried running Augustus with --codingseq=1 and then ran getAnnoFasta.pl on the resulting gff file, but seemed to get the same results as without the --codingseq=1 (i.e. I get the exons but not the entire gene nucleotide sequence with introns). Can you clarify more please?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

by katharina on 12.10.2012 - 17:18
I give you an example that I recently executed:

Code: Select all

augustus --species=some_species --exonnames=on --codingseq=on protein=on genome.fa > aug.out
getAnnoFasta.pl --seqfile=genome.fa aug.out
The latter command gave me the following files:

Code: Select all

aug.cdsexons # contains coding exons 
aug.codingseq # contains CDS sequences 
aug.aa # contains amino acid sequences 
aug.mrna # is only produced if you enabled predictions with UTR, contains transcript sequences, then
Did you run augustus and the getAnnoFasta.pl script in that manner?
lingonghua
Posts: 3
Joined: Mon Nov 30, 2015 12:08 pm

Re: extract gene sequence from AUGUSTUS predictions

Post by lingonghua »

Hi Katharina,
When using getAnnoFasta.pl to extrac .codingseq file, I find there is no difference between the results file with or without --chop_cds option.
My commands are:
augustus --strand=both --species=nasonia aaa.fa --outfile=aaa.gff --codingseq=on --protein=on
perl /home/lgh/ngs/augustus-3.2.1/scripts/getAnnoFasta.pl --chop_cds --seqfile=aaa.fa aaa.gff

For example I have a sequence in aaa.fa of
>C8001580 71.0
TATAAGAAAAAATTTAATACTAAGTAATTTACTTAATTTTTTCCAGTCTTAATGCTGTCA
ATATGGCTTGGTGTAGTTTAGATACTGAGACTATGACATTACTTTGTAAATCTTTGCCTC
CGTCCGTTACGCGTTTAAATATAGCTGGATGCAGAAAAACTATGACAGATGATAGTAAGT
GTAATATTATATTTAAAAAAAAAATTATTATGCCAATTAATATCATTTTTATGTCTTATT
TCAGATGTTAAAGATTTAGTAAAAAGTTGTCCAGATATAATAGAATTAGATTTGAGTGAT
TGTACTATGCTTACAATGAATACTGTTCGTAGTTTACTTGATTTATCAAAATTAGAACAT
TTGTCGTTAAGTCGTTGTTATGGTATACCTCCTTCAACATATGTAACATTGGCATATATG
CCATCTTTGCTATATTTGGATGTTTTTGGTGTAATACCTGAACCAGTACTAAAAACATTA
CAAGTTACCTGTGGTGAAACTCAACTTAATAAATATCTATATAGTTCTGTTGCAAGACCA
ACAGTTGGTGTTCGAAGAACAAGTATTTGGGGACTTCGTGTTAGAGATTGAATGAAACTG
TAACATTTATCTATATAATTAGTGGTAAATAATCAATTTAATACCAAAGGGATATTGAAA
TGTACTTGTGTGCCTTTCTATTAAGTGTCTTTTTAACAAGAGACTGATCTTATGGTACCA
TCAAAATAAAAGCAATACATTTTTTTGTATAAAATTCTGTTTGTCCTTTTTTCCTATTTT
GCTTTTTTATTTATAATTAAAATAATAATTAAATTATATTAAGTTACAATCAATTTTTCA
AATTTATATTAAACAAAACGAAACAAGAGTGAAATAATTCTATTTTCATCACTATTATCT
TGCTTTATTTAATATAAACTCAAATTGCTTATAATTAATAGTAAATTAATAATAATTAAT
AATAAATAATTTTAATTAATATTTTTATATATTTTTATATAT

And finally I always get a sequence in the aaa.codingseq as:
>C8001580.g1.t1
tcttaatgctgtcaatatggcttggtgtagtttagatactgagactatgacattactttgtaaatctttgcctccgtccgttacgcgtttaaatatagctggatgcagaaaaactatgacagatgataatgttaaagatttagtaaaaagttgtccagatataatagaattagatttgagtgattgtactatgcttacaatgaatactgttcgtagtttacttgatttatcaaaattagaacatttgtcgttaagtcgttgttatggtatacctccttcaacatatgtaacattggcatatatgccatctttgctatatttggatgtttttggtgtaatacctgaaccagtactaaaaacattacaagttacctgtggtgaaactcaacttaataaatatctatatagttctgttgcaagaccaacagttggtgttcgaagaacaagtatttggggacttcgtgttagagattga

It is not begin with the first codon.
Is there a bug in the getAnnoFasta.pl?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

Can you please also post the corresponding excerpt from aaa.gff?
lingonghua
Posts: 3
Joined: Mon Nov 30, 2015 12:08 pm

Re: extract gene sequence from AUGUSTUS predictions

Post by lingonghua »

Hi Katharina,
Thanks for your prompt reply! Here is the whole content in the aaa.gff.

# This output was generated with AUGUSTUS (version 3.2.1).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /home/lgh/ngs/augustus-3.2.1/config/ ...
# nasonia version. Using default transition matrix.
# Looks like aaa.fa is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 1002, name = C8001580) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
C8001580 AUGUSTUS gene 1 591 0.89 + . g1
C8001580 AUGUSTUS transcript 1 591 0.89 + . g1.t1
C8001580 AUGUSTUS intron 1 46 0.89 + . transcript_id "g1.t1"; gene_id "g1";
C8001580 AUGUSTUS intron 175 244 0.92 + . transcript_id "g1.t1"; gene_id "g1";
C8001580 AUGUSTUS CDS 47 174 0.89 + 1 transcript_id "g1.t1"; gene_id "g1";
C8001580 AUGUSTUS CDS 245 591 0.92 + 2 transcript_id "g1.t1"; gene_id "g1";
C8001580 AUGUSTUS stop_codon 589 591 . + 0 transcript_id "g1.t1"; gene_id "g1";
# coding sequence = [tcttaatgctgtcaatatggcttggtgtagtttagatactgagactatgacattactttgtaaatctttgcctccgtcc
# gttacgcgtttaaatatagctggatgcagaaaaactatgacagatgataatgttaaagatttagtaaaaagttgtccagatataatagaattagattt
# gagtgattgtactatgcttacaatgaatactgttcgtagtttacttgatttatcaaaattagaacatttgtcgttaagtcgttgttatggtatacctc
# cttcaacatatgtaacattggcatatatgccatctttgctatatttggatgtttttggtgtaatacctgaaccagtactaaaaacattacaagttacc
# tgtggtgaaactcaacttaataaatatctatatagttctgttgcaagaccaacagttggtgttcgaagaacaagtatttggggacttcgtgttagaga
# ttga]
# protein sequence = [LNAVNMAWCSLDTETMTLLCKSLPPSVTRLNIAGCRKTMTDDNVKDLVKSCPDIIELDLSDCTMLTMNTVRSLLDLSK
# LEHLSLSRCYGIPPSTYVTLAYMPSLLYLDVFGVIPEPVLKTLQVTCGETQLNKYLYSSVARPTVGVRRTSIWGLRVRD]
# end gene g1
###
# command line:
# augustus --strand=both --species=nasonia aaa.fa --outfile=aaa.gff --codingseq=on --protein=on
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: extract gene sequence from AUGUSTUS predictions

Post by katharina »

getAnnoFasta.pl has two options to print coding sequences:

1) take the coding sequence from the AUGUSTUS output, directly. This is what happens in your case. The chop_cds flag is not applicable in that case.

2) extract coding sequence from genome file. This is required if the AUGUSTUS output does not contain the coding sequence. (E.g. because augustus was run without the appropriate flag.) In that case, chop_cds flag takes effect.

It is not a bug, but bad documentation. Documentation will be better in the next release.
lingonghua
Posts: 3
Joined: Mon Nov 30, 2015 12:08 pm

Re: extract gene sequence from AUGUSTUS predictions

Post by lingonghua »

Problem solved. Thanks a lot katharina!
Post Reply