formating the gff file

Post by **katharina** » Thu Nov 19, 2015 2:58 pm

Originally posted in the old forum by Clement on 16.10.2014 - 19:19
Hi all,
I'm using Augustus for gene prediction in a new species. Both the genome and the hint file are home-made.
When running autoAugTrain.pl, i get the following error message :
ERROR: training.gb is empty. Possible reasons:
a) features in a provided training gene structure gff file were not compliant with the autoAug.pl pipeline (for instructions read at e.g.
http://bioinf.uni-greifswald.de/webaugu ... #structure)
The link seems to be invalid now, so I'm hoping to get help in this forum. This is how my gff file looks like :
pilon_round_18_contig_2558 exonerate CDS 345798 345888 . - . ID=gene_1
pilon_round_18_contig_2558 exonerate CDS 344999 345193 . - . ID=gene_1
pilon_round_18_contig_3684 exonerate CDS 684414 685064 . - . ID=gene_2
pilon_round_18_contig_3684 exonerate CDS 683996 684190 . - . ID=gene_2
Do you see any obvious problem with the formating ?
Best,
Clement

Post by **katharina** » Thu Nov 19, 2015 2:58 pm

by katharina on 17.10.2014 - 07:58
You find a description of the accepted format at http://bioinf.uni-greifswald.de/webaugu ... ile_format
Have a look at the last column of your file.
Katharina

Post by **katharina** » Thu Nov 19, 2015 2:58 pm

by Clement on 17.10.2014 - 12:26
Thanks for the link.
I think I've adopted the exact same formating now (see below) and still have the same error message...
Any idea what else could be the problem ?
Vielen dank!
Clement
pilon_round_18_contig_2558 exonerate CDS 345798 345888 1 - . transcript_id "1"
pilon_round_18_contig_2558 exonerate CDS 344999 345193 1 - . transcript_id "1"
684414 685064 1 - . transcript_id "2"
pilon_round_18_contig_3684 exonerate CDS 683996 684190 1 - . transcript_id "2"

Post by **katharina** » Thu Nov 19, 2015 2:58 pm

by Clement on 17.10.2014 - 12:28
the third line of the file is formated just like the three others.
sorry about the bad copy/paste.

Post by **katharina** » Thu Nov 19, 2015 2:58 pm

by katharina on 17.10.2014 - 13:01
That's weird. I have submitted a job with your file format (corrected line 3, adapted to fasta names in one of my fasta files, made sure the contigs are long enough):

Code: Select all

NT_039169.8 exonerate CDS 345798 345888 1 - . transcript_id "1" 
NT_039169.8 exonerate CDS 344999 345193 1 - . transcript_id "1" 
NT_039169.8 exonerate CDS 684414 685064 1 - . transcript_id "2" 
NT_039169.8 exonerate CDS 683996 684190 1 - . transcript_id "2"

And it works fine. I mean, I cannot train AUGUSTUS with two genes, and obviously I used a wrong sequence template, but the file format works.
Katharina

Post by **katharina** » Thu Nov 19, 2015 2:59 pm

by Clement on 17.10.2014 - 13:08
Hey
I think I just found the issue. It was really silly of me : after running the script for simplifying the headers of the genome file, the first column of the hint file was not corresponding to the genome file fasta headers anymore ...!
Anyway it seems to be running now. Sorry about that. And Thanks for your help .
Best,
Clement

AUGUSTUS Forum

formating the gff file

formating the gff file

Re: formating the gff file

Re: formating the gff file

Re: formating the gff file

Re: formating the gff file

Re: formating the gff file