How to take data input from databases for training of beta vulgaris genome

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

How to take data input from databases for training of beta vulgaris genome

Post by katharina »

Originally posted in the old forum by SC_LU on 05.05.2014 - 22:55
Hi,
I want use augustus for Beta vulgaris training. But i am little confused what data and from where to take data for training.
1. Form genome tab, beta vulgaris has 9 chromosome (http://www.ncbi.nlm.nih.gov/genome/?term=Beta+vulgaris). so I should download each chromosome separately and make a single file for beta vulgaris genome.fasta. then submit it for whole genome is it like this?
2. I am slightly sure for EST data of this species. when i am looking for est data for from NCBI repository(http://www.ncbi.nlm.nih.gov/nucest/?term=Beta+vulgaris). it has some bogus hit as well. So, taking est like this is also confusing and not sure all the ESTs belongs to same genome.
3. same kind of confusion for Protein.
Plz explain, how to take data(GENOME, EST and PROTEIN) for beta vulgaris training.
User avatar
katharina
Site Admin
Posts: 531
Joined: Wed Nov 18, 2015 6:14 pm
Location: Greifswald
Contact:

Re: How to take data input from databases for training of beta vulgaris genome

Post by katharina »

by katharina on 07.05.2014 - 11:26
1) Yes, you should concatenate all chromsome files. While doing so, also have a look at the fasta headers, they should be short and unique (have a look at http://bioinf.uni-greifswald.de/webaugu ... on_problem )
2) On that NCBI page, on the right, you find a list "Top Organisms", click on your target species (in your particular case, you might have to do that twice, in two separate download steps, because Beta vulgaris also has a subspecies listed). This should eliminate ESTs that do not belong to your species from the list. Again, pay attention to the fasta headers after download, you need to modify them!
3) The same as in 2) should work for Proteins.
Post Reply