pre-compiled datasets

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
lucas.schmitz
Posts: 2
Joined: Sun Aug 27, 2017 8:55 pm

pre-compiled datasets

Post by lucas.schmitz »

I have a question about the config files for the built-in species. What was the general strategy to generate/train these? I.e. did it involve high quality genbank files, genome contigs, RNAseq, proteome, ESTs, or a combination of these? I'm curious what would be the ideal data to train Augustus as viewed by the developers. Also, is there anyway I could retrace this in the Augustus installation? I tried to find metadata or something similar to no avail.
lucas.schmitz
Posts: 2
Joined: Sun Aug 27, 2017 8:55 pm

Re: pre-compiled datasets

Post by lucas.schmitz »

just a quick follow-up: in the 2004 paper by Mario Stanke "AUGUSTUS: a web server for gene finding in eukaryotes" I read that single-gene training sets were used to estimate parameters. So, what exactly are these single-gene training sets? Are these highly accurate/manually curated genes not necessarily from the same organism?
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: pre-compiled datasets

Post by katharina »

Actually, there is no "one answer for all species". Training gene generation and selection was done in different ways for many species.

Training genes are usually from the target organism.

For older parameter sets, ESTs were often used as a basis, yes. For some species, protein sequences were used. More recent parameter sets may be based on RNA-Seq. In general, some extrinsic information was mapped against the genome and training gene structures in the target genome were thus generated.

We will keep it in mind to publish notes on training gene information with future parameter sets.
Post Reply