Page 1 of 1

pre-compiled datasets

Posted: Sun Aug 27, 2017 9:04 pm
by lucas.schmitz
I have a question about the config files for the built-in species. What was the general strategy to generate/train these? I.e. did it involve high quality genbank files, genome contigs, RNAseq, proteome, ESTs, or a combination of these? I'm curious what would be the ideal data to train Augustus as viewed by the developers. Also, is there anyway I could retrace this in the Augustus installation? I tried to find metadata or something similar to no avail.

Re: pre-compiled datasets

Posted: Sun Aug 27, 2017 9:21 pm
by lucas.schmitz
just a quick follow-up: in the 2004 paper by Mario Stanke "AUGUSTUS: a web server for gene finding in eukaryotes" I read that single-gene training sets were used to estimate parameters. So, what exactly are these single-gene training sets? Are these highly accurate/manually curated genes not necessarily from the same organism?

Re: pre-compiled datasets

Posted: Tue Oct 24, 2017 12:53 pm
by katharina
Actually, there is no "one answer for all species". Training gene generation and selection was done in different ways for many species.

Training genes are usually from the target organism.

For older parameter sets, ESTs were often used as a basis, yes. For some species, protein sequences were used. More recent parameter sets may be based on RNA-Seq. In general, some extrinsic information was mapped against the genome and training gene structures in the target genome were thus generated.

We will keep it in mind to publish notes on training gene information with future parameter sets.