Problem with training gene structures

Discussions about training AUGUSTUS from various sources of evidence. Not discussed here: BRAKER1 and WebAUGUSTUS!

Moderator: bioinf

Post Reply
norbert
Posts: 2
Joined: Mon Jan 03, 2022 2:22 pm

Problem with training gene structures

Post by norbert »

Hi,
I have already asked this question on github. Maybe someone can help me here.
I want to generate training gene structures from assembled NovaSeq sequence data and PacBio (Sequel II) genome sequences.
I followed the procedures in the “augustus.wrp.pdf” and stuck at:
:$ gtf2aa.pl genome.fa bonafide.f.gtf prot.aa
The generated port.aa file does not contain separated protein sequences. Instead, it contains one long aminoacid string separated by * as stop.
The file prot.aa contains:

MSQPSSPTIHLIREERELEKHLYAHPMPFLVVPRRLKSKSFFIFEQEAELQESFLYLFFMDNQFISFSLGVIFSIAFFTAFKGWRCSSNVFFELELEKQKKKAEYLLNRSWNHYCNTTGREGQLPEGLSFKDMIPLFIEMEDNIKEKILSLNQMNLDLISNGMESDYFINIINSYGLMTSSNTRSGLIPYQKSLSQLLEKFEEVTFFHLSREKKKFADALTTLASMTKIDCGVDLQPLQIEARDSQAYCLNIEEEPDGRPWYHGMLQYIKTWEYPPGATEKENNSEAIHGVLPMQSMISNPAEAIPERGFIP....... and more..

I don’t think this is correct ? is there an easy solution ?

I would like to make a non-redundant set of genes for the next steps.

Here my detailed description of my procedure:
Hardware: 2xAMD (64 cores each), 2TB RAM, 21 GB SSD
System Ubuntu 20.04.3 LTS

I named the file containing the transcripts cdna.fa (562763 transcripts) and the file for the genome genome.fa (570 contigs)

I followed the description from the pdf “augustus_wrp.pdf”
The cdna.fa (>seq) as well as the genome.fa (>contig) headers were simplified by
simplifyFastaHeaders.pl

The parameters from the template file were modified: alignAssembly.config

PASA:
:$ Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome.fa -t cdna.fa.clean -T -u cdna.fa --ALIGNERS blat,gmap --CPU 96

Problem: it worked only with --CPU 2 Is there a reason why ?

:$ pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta sample_mydb_pasa.assemblies.fasta --pasa_transcripts_gff3 sample_mydb_pasa.pasa_assemblies.gff3

:$ grep complete sample_mydb_pasa.assemblies.fasta.transdecoder.cds | perl -pe ’s/>(\S+)./$1/’ > complete.orfs
:$ grep -F -f complete.orfs sample_mydb_pasa.assemblies.fasta.transdecoder.genome.gff3 | grep -P "(\tCDS\t|\texon\t)" | perl -pe ‘s/cds.//; s/.exon\d+//;’ > trainingSetComplete.gff3
:$ mv trainingSetComplete.gff3 trainingSetComplete.temp.gff3
:$ cat trainingSetComplete.temp.gff3 | perl -pe ‘s/\t\S(asmbl_\d+).*/\t$1/’ | sort -n -k 4 | sort -s -k 9 | sort -s -k 1,1 > trainingSetComplete.gff3
:$ ln -s trainingSetComplete.gff3 bonafide.gtf
:$ computeFlankingRegion.pl bonafide.gtf
:$ gff2gbSmallDNA.pl bonafide.gtf genome.fa 10000 bonafide.gb
:$ cat bonafide.gb | perl -ne ‘if(m//gene="(\S+)"/){ \print """.$1.""\n";}’ | sort -u > traingenes.lst
:$ grep -f traingenes.lst -F bonafide.gtf > bonafide.f.gtf

Unfortunately, the file bonafide.f.gtf is empty. When I have removed the “ from the traingenes.lst it worked.

:$ ln -s bonafide.gtf bonafide.f.gtf

And finally:

:$ gtf2aa.pl genome.fa bonafide.f.gtf prot.aa

I do not know if the problem is due to PASA or Augustus...
norbert
Posts: 2
Joined: Mon Jan 03, 2022 2:22 pm

Re: Problem with training gene structures

Post by norbert »

I think I'm getting closer. During the run of gtf2aa.pl the message appears:
...
Use of uninitialized value $txid in hash element at /usr/share/augustus/scripts/gtf2aa.pl line 42, <GTF> line 170407.
Use of uninitialized value $txid in hash element at /usr/share/augustus/scripts/gtf2aa.pl line 42, <GTF> line 170409.
Use of uninitialized value $txid in hash element at /usr/share/augustus/scripts/gtf2aa.pl line 42, <GTF> line 170411.
^C
This is from the gtf2aa.pl
...
39 $_ =~ m/transcript_id \"(\S+)\"/;
40 my $txid = $1;
41 my @line = split(/\t/);
42 push @{$gtf{$line[0]}{$txid}}, $_;
...
For testing purposes I have generated a GTF file which contains CDS and transcript_id... without succsess..
Post Reply