Creating a Whole-Genome-Alignment with progressiveCactus

This tutorial describes how a whole-genome alignment of multiple genomes can be created with the Cactus aligner.
The output alignment is a compressed file in HAL format that can be exported as MAF.

1. Running progressive Cactus

Prepare a text file with the species tree followed by a space-separated list of species names and location of the corresponding genome fasta files (absolute path names required!).

[+] file format...

tree in Newick format
name_of_genome_1  absolute/path/to/genome_1
name_of_genome_2  absolute/path/to/genome_2
...
name_of_genome_N  absolute/path/to/genome_N

For furter information and optional parameters, read the progressiveCactus manual

cp tree.nwk vertebrates.txt
for f in $PWD/genomes/*.fa; do echo -ne "$(basename $f .fa)\t$f\n"; done >>vertebrates.txt

The file vertebrates.txt will now look like this (except for the parent directory)

((monDom5:0.340786,(((hg38:0.035974,rheMac3:0.043601):0.109934,(mm10:0.084509,rn6:0.091589):0.271974):0.020593,(bosTau8:0.18908,canFam3:0
.13303):0.032898):0.258392):0.181168,galGal4:0.559442);
bosTau8 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/bosTau8.fa
canFam3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/canFam3.fa
galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/galGal4.fa
hg38    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/hg38.fa
mm10    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/mm10.fa
monDom5 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/monDom5.fa
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rheMac3.fa
rn6     /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rn6.fa

Run progressiveCactus, e.g. using 8 threads (~12min):

runProgressiveCactus.sh vertebrates.txt cactusout vertebrates.hal --maxThreads=8 2>&1 > cactus.out

This creates a .hal binary indexed alignment file vertebrates.hal

2. Export HAL alignment as MAF

Next, we will export the HAL alignment as MAF. Observe, that the MAF format is reference-based, i.e. alignments can only be represented with respect to some reference species. Generally any species can be chosen as reference. We recommend, however, to choose a reference species that 1.) has a high-quality genome assembly and 2.) is placed somewhere in the middle of the clade, e.g. in our case human (hg38). For parallelization of the gene prediction step, we recommend to split the global alignmnent into several overlapping alignment chunks. This can be achieved with hal2maf_split.pl using the parameters --chunksize that specifies the maximum chunk size (number of bases covered in the reference genome) and --overlap that specifies the amount of overlap (number of bases) between two consecutive chunks. Observe that hal2maf_split.pl works on top of the HAL tools toolbox, which is also part of the progressiveCactus package. If the progressiveCactus/submodules/hal/bin directory in is not in your global python path, use the parameter --hal_exec_dir to point to the directory

Export vertebrates.hal as MAF

hal2maf_split.pl --halfile vertebrates.hal --refGenome hg38 --cpus 8 --chunksize 50000 --overlap 25000 --outdir mafs

This will generate the directory mafs/ that contains the output alignment chunks in MAF format

-rw-r--r-- 1 stefanie bioinf 392K May 24 19:43 chr16.0-49999.maf
-rw-r--r-- 1 stefanie bioinf 307K May 24 19:43 chr16.100000-149999.maf
-rw-r--r-- 1 stefanie bioinf 365K May 24 19:43 chr16.125000-174999.maf
-rw-r--r-- 1 stefanie bioinf 586K May 24 19:43 chr16.150000-199999.maf
-rw-r--r-- 1 stefanie bioinf 395K May 24 19:43 chr16.175000-210154.maf
-rw-r--r-- 1 stefanie bioinf 345K May 24 19:43 chr16.25000-74999.maf
-rw-r--r-- 1 stefanie bioinf 408K May 24 19:43 chr16.50000-99999.maf
-rw-r--r-- 1 stefanie bioinf 418K May 24 19:43 chr16.75000-124999.maf

For demonstration purposes, we chose very small values for the chunk size and overlap. In a more realistic setting, i.e. when working with whole-genome alignments, we recommend a chunk size in the megabase order, e.g. --chunksize=2500000 and --overlap=500000 (default values).

[+] avoid splitting within genes...

Gene prediction errors may occur from splitting of the alignment within genes. The smaller the chunk size is chosen, the more likely genes are truncated at alignment boundaries. To avoid this, a list of genic regions/intervals may be passed to hal2maf_split.pl. In this case, the splitting is restricted to intergenic regions, e.g. position outside of the given intervals. The genic intervals can be obtained from existing annotations for the reference species, if available.

Use the hg38 RefSeq annotation refseq/refseq.hg38.gtf to create a text file with a list of genic intervals, in which the splitting is forbidden. Each genic interval is specified as follows: seqname <tab> start <tab> end <newline>

gtf2gff.pl --printIntron <refseq/refseq.hg38.gtf --out=/dev/stdout | cut -f 1,4,5 >no_split_list.txt

If your annotation is in gtf format as in the example, don't forget to include the intron lines!

Rerun hal2maf_split.pl using the parameter --no_split_list to pass no_split_list.txt

hal2maf_split.pl --halfile vertebrates.hal --refGenome hg38 --cpus 8 \
--chunksize 50000 --overlap 25000 --no_split_list no_split_list.txt --outdir mafs_good_split

This will generate the directory mafs_good_split/ that contains the following MAF alignments

-rw-r--r-- 1 stefanie bioinf  707K May 25 18:17 chr16.0-91311.maf
-rw-r--r-- 1 stefanie bioinf  442K May 25 18:17 chr16.173067-210154.maf
-rw-r--r-- 1 stefanie bioinf  913K May 25 18:17 chr16.45035-173067.maf
-rw-r--r-- 1 stefanie bioinf 1013K May 25 18:17 chr16.91311-207894.maf

Observe that when using the --no_split_list option, the maximum chunk size is not always respected.

[+] splitting MAF alignments...

Let's assume you already have a vetted whole-genome alignment in MAF format. You can split it similarily to HAL alignments into overlapping alignment chunks. Let hg38 be the reference genome in aln.maf.

REF="hg38"

Create a tab-separated list of reference chromosomes/scaffolds of and their lengths

grep "^s $REF" aln.maf | perl -ne '@a=split(/\s+/,$_); $a[1]=~s/.*\.//; print "$a[1]\t$a[5]\n";' | sort -u >seqlist

Cut aln.maf into overlapping pieces with the auxiliary tool filterMaf.pl from the scripts/ folder in the Augustus package.

mkdir mafs

CHUNKSIZE=2500000
OVERLAP=500000
cat seqlist | while read line
do
SEQ=$(echo "$line" | cut -f 1)
CHRLEN=$(echo "$line" | cut -f 2)
for ((start=0; start<CHRLEN; start+=(CHUNKSIZE-OVERLAP))); do 
   end=$((start+CHUNKSIZE))
   if [ $end -ge $CHRLEN ]; then
      end=$CHRLEN
   fi
   filterMaf.pl --interval=${REF}.${SEQ}:$start-$end <aln.maf  >mafs/${SEQ}:$start-$end.maf
done
done

3. Build a comparative assembly hub

hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"

Manually edit the .html files in the vertHub/ directory, if desired.

4. Load the hub and browse the alignment

Copy the vertHub/ directory to a web directory:

scp -r vertHub/ stefanie@greifweb1:/var/www/bioinf/htdocs/stefanie/cgp-tutorial/

Go to the UCSC Genome Browser Track Data Hubs page and paste your hub.txt url into the "My Hubs" tab. Select a species as reference (e.g. hg38) and click "Go".

5. Upload a reference annotation to the hub

Convert the hg38 RefSeq annotation to BED format and place it in a folder with the name of the genome

mkdir refseq/hg38
gtf2bed.pl <refseq/refseq.hg38.gtf >refseq/hg38/refseq.bed --itemRgb=25,25,112

Specify any RGB color you like for the track with option --itemRgb, e.g. 25,25,112.
Rerun the hal2assemblyHub.py script. Include gene tracks with option --bedDirs.

rm -r jobTree # delete job data from previous run
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs refseq \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"

For further information on the specification of parameter --bedDirs, read section 'Annotations provided by users' in the assemblyHub manual.

[+] hal2assemblyHub troubleshooting...

If the hal2assemblyHub.py script returns immediately without error message. Delete the jobTree folder and rerun it again.

rm -r jobtree

Make sure that all auxiliary tools, e.g hgGcPercent, are in the global path. If not export

export PATH=/home/stefanie/tools/progressiveCactus/bin:/home/stefanie/tools/progressiveCactus/submodules/kentToolBinaries:$PATH
export PYTHONPATH=/home/stefanie/tools/progressiveCactus/submodules:$PYTHONPATH

Repeat 4. Load the hub and browser the alignment.
The new track, should be visible in the 'VertebratesCompHub Track Settings' menu, enable it if necessary.