pre-compiled datasets
Moderator: bioinf
-
- Posts: 2
- Joined: Sun Aug 27, 2017 8:55 pm
pre-compiled datasets
I have a question about the config files for the built-in species. What was the general strategy to generate/train these? I.e. did it involve high quality genbank files, genome contigs, RNAseq, proteome, ESTs, or a combination of these? I'm curious what would be the ideal data to train Augustus as viewed by the developers. Also, is there anyway I could retrace this in the Augustus installation? I tried to find metadata or something similar to no avail.
-
- Posts: 2
- Joined: Sun Aug 27, 2017 8:55 pm
Re: pre-compiled datasets
just a quick follow-up: in the 2004 paper by Mario Stanke "AUGUSTUS: a web server for gene finding in eukaryotes" I read that single-gene training sets were used to estimate parameters. So, what exactly are these single-gene training sets? Are these highly accurate/manually curated genes not necessarily from the same organism?
Re: pre-compiled datasets
Actually, there is no "one answer for all species". Training gene generation and selection was done in different ways for many species.
Training genes are usually from the target organism.
For older parameter sets, ESTs were often used as a basis, yes. For some species, protein sequences were used. More recent parameter sets may be based on RNA-Seq. In general, some extrinsic information was mapped against the genome and training gene structures in the target genome were thus generated.
We will keep it in mind to publish notes on training gene information with future parameter sets.
Training genes are usually from the target organism.
For older parameter sets, ESTs were often used as a basis, yes. For some species, protein sequences were used. More recent parameter sets may be based on RNA-Seq. In general, some extrinsic information was mapped against the genome and training gene structures in the target genome were thus generated.
We will keep it in mind to publish notes on training gene information with future parameter sets.