Navigate to main AugCGP tutorial. Cactus alignments and assembly Hubs. AugCGP de novo. AugCGP with RNA-Seq. Annotation transfer with AugCGP. Cross-species consistency of gene sets.
Show all / no details.

Combining Annotation transfer and RNA-Seq-based prediction

In the typical clade annotation scenario, most genomes are supplemented with RNA-Seq data, whereas a few may represent model organsims for which high quality annotations exist (e.g. human and mouse in the vertebrate clade). This tutorial describes how RNA-Seq evidence and evidence from existing annotations can be combined in Augustus-cgp.

1. Create a database with RNA-Seq and annotation hints

If you don't have a database with the genomes and RNA-Seq hints, yet, follow the instructions in 1. Load RNA-Seq hints ...
to create the database vertebrates_rnaseq.db.

Make a copy of the database
cp vertebrates_rnaseq.db vertebrates_rnaseq+anno.db
and load the annotation hints from exercise 4.1 into the new database
load2sqlitedb --species=hg38 --dbaccess=vertebrates_rnaseq+anno.db refseq/hg38.hints.gff
You can check if loading was successful with following database query
sqlite3 -header -column vertebrates_rnaseq+anno.db "SELECT count(*) AS '#hints',typename,speciesname FROM 
(hints as H join featuretypes as F on H.type=F.typeid) natural join speciesnames group by speciesid,typename;"
that returns a summary of how many hints of each type are in the database for each species.
Your database should now contain both RNA-Seq hints from vertebrates_rnaseq.db and annotation hints from vertebrates_anno.db
#hints      typename    speciesname
----------  ----------  -----------
3368        exonpart    galGal4    
129         intron      galGal4    
86          CDS         hg38       
7905        exonpart    hg38       
345         intron      hg38       
7930        exonpart    mm10       
378         intron      mm10       
11050       exonpart    rheMac3    
265         intron      rheMac3 

3. Prepare an extrinsic config file

Start by copying following extrinsic configuration file:
cp ${AUGUSTUS_CONFIG_PATH}extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq+anno.cfg 
Open the extrinsic-rnaseq+anno.cfg file with a text editor, go to the first [GROUP] section and replace the following line
[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
none
by the names of genomes with annotation RNA-Seq hints, i.e.
[GROUP]
hg38 mm10 rheMac3 galGal4
Note, that in our case, nothing further has to be done, since the only genome with annotation hints - hg38 - is already covered in the first table. In other applications, you may have genomes with annotations, but no RNA-Seq data. In this case the names of the genomes that ONLY have annotation hints must be listed in the second [GROUP] section.

[+] format of the extrinsic.cfg file in cgp mode ...

4. Run AUGUSTUS-CGP with RNA-Seq and annotation hints

Create a new folder for the liftover experiments and switch to the new directory
mkdir augCGP_rnaseq+liftover
cd augCGP_rnaseq+liftover
For convenience assign each alignment chunk to a job ID by creating softlinks
num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done
Run Augustus with retrieval of hints from the database (~3min).
for id in *.maf
do
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$id \
--dbaccess=../vertebrates_rnaseq+anno.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--dbhints=1 \
--UTR=1 \
--allow_hinted_splicesites=atac \
--extrinsicCfgFile=../extrinsic-rnaseq+anno.cfg \
--/CompPred/outdir=pred${id%.maf} > aug${id%.maf}.out 2> err${id%.maf}.out &
done
This will generate the folders pred*/ (one for each alignment chunk) that contain gff files with gene predictions for each input genome.
bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff
Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.
For real applications with several hundreds or thousands of alignment chunks, we recommend to run job arrays on a compute cluster.

5. Merge gene predictions from parallel runs

6. Upload gene predictions into the assembly hub

Convert the final gene predictions from gff to BED format and place each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.
for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_rnaseq+anno.bed --itemRgb=255,165,0
done
Specify any RGB color you like for the track with option --itemRgb, e.g. 255,165,0.
The name of the current directory (i.e. augCGP_rnaseq+liftover) will be used as track name on the browser.
Switch back to the main working directory data/
cd ..
and rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_rnaseq+liftover \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g. --bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...

Repeat 4. Load the hub and browser the alignment.