# 8-way primary assembly alignment for T2T Apes

## Guide Tree

Tree (mashtree v1.4.6):

```
mashtree --numcpus 12 *.fasta > 8-t2t-apes-2023v2.dnd
```
Reroot the tree with siamang as the outgroup.

## Alignment

Alignment (cactus v2.7.1):

Make a SLURM config (bigger lastz chunks)

```
cp /private/groups/cgl/cactus/cactus-bin-v2.7.1/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
sed -i config-slurm.xml -e 's/blast chunkSize="30000000"/blast chunkSize="90000000"/g'
sed -i config-slurm.xml -e 's/dechunkBatchSize="1000"/dechunkBatchSize="200"/g'
```
Make the alignment using this input
```
(((GCA_028885655.2:0.0017500000000000016,GCA_028885625.2:0.001729999999999999):0.014950000000000001,(GCA_029281585.2:0.00877,((GCA_029289425.2:0.0019999999999999983,GCA_028858775.2:0.0022900000000000004):0.0043300000000000005,(hs1:5.000000000000004E-4,hg38:5.000000000000004E-4):0.005989999999999999):0.0014300000000000007):0.0073599999999999985):0.011345,GCA_028878055.2:0.011345000000000003);

hs1                  https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz
hg38                 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz
GCA_029281585.2	https://hgdownload.soe.ucsc.edu/hubs/GCA/029/281/585/GCA_029281585.2/GCA_029281585.2.fa.gz 
GCA_028858775.2 https://hgdownload.soe.ucsc.edu/hubs/GCA/028/858/775/GCA_028858775.2/GCA_028858775.2.fa.gz
GCA_029289425.2 https://hgdownload.soe.ucsc.edu/hubs/GCA/029/289/425/GCA_029289425.2/GCA_029289425.2.fa.gz
GCA_028885655.2 https://hgdownload.soe.ucsc.edu/hubs/GCA/028/885/655/GCA_028885655.2/GCA_028885655.2.fa.gz
GCA_028885625.2 https://hgdownload.soe.ucsc.edu/hubs/GCA/028/885/625/GCA_028885625.2/GCA_028885625.2.fa.gz
GCA_028878055.2 https://hgdownload.soe.ucsc.edu/hubs/GCA/028/878/055/GCA_028878055.2/GCA_028878055.2.fa.gz

```
The accessions are:
```
GCA_029281585.2 Gorilla_gorilla
GCA_028858775.2 Pan_troglodytes
GCA_029289425.2 Pan_paniscus
GCA_028885655.2 Pongo_abelii
GCA_028885625.2 Pongo_pygmaeus
GCA_028878055.2 Symphalangus_syndactylus
```
Cactus command

```
TOIL_SLURM_ARGS="--partition=long --time=8000" cactus ./js-8apes ./8-t2t-apes-2023v2.seqfile ./8-t2t-apes-2023v2.hal --batchSystem slurm --caching false --consCores 64 --configFile ./config-slurm.xml --logFile 8-t2t-apes-2023v2.hal.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp
```

## Maf Export

MAF for each primary assembly

```
for i in hs1 hg38 GCA_028858775.2 GCA_028885655.2 GCA_028885625.2 GCA_028878055.2 GCA_029281585.2 GCA_029289425.2; do TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-hal2maf ./js_hal2maf8 ./8-t2t-apes-2023v2.hal ./8-t2t-apes-2023v2.${i}.maf.gz --filterGapCausingDupes --refGenome $i --chunkSize 500000 --batchCores 64 --noAncestors --batchCount 16  --batchSystem slurm --caching false --logFile ./8-t2t-apes-2023v2.${i}.gz.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp ;done
```

Then make a bigmaf (Note: using cactus commit a8bd77e65d7f7c26fd7a6d69a110d1fe23b275c9 for this one -- reproduce with v2.7.2)

```
for i in hs1 hg38 GCA_028858775.2 GCA_028885655.2 GCA_028885625.2 GCA_028878055.2 GCA_029281585.2 GCA_029289425.2; do TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-maf2bigmaf ./js-bb ./8-t2t-apes-2023v2.${i}.maf.gz ./8-t2t-apes-2023v2.${i}.bigmaf.bb --refGenome $i --halFile ./8-t2t-apes-2023v2.hal --logFile ./8-t2t-apes-2023v2.${i}.bigmaf.bb.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp --batchSystem slurm; done
```

## Coverage

Compute some pairwise coverage and identity statistics from each MAF
(uses taffy commit 5d5c019dee562ea8c62c8ab3ce74491734fe5a23)

```
for i in hs1 hg38 GCA_028858775.2 GCA_028885655.2 GCA_028885625.2 GCA_028878055.2 GCA_029281585.2 GCA_029289425.2; do taffy view -i ./8-t2t-apes-2023v2.${i}.maf.gz | taffy coverage -g "$(halStats --genomes ./8-t2t-apes-2023v2.hal)" > ./8-t2t-apes-2023v2.${i}.maf.coverage.tsv ; done
```

## Chains export

All-to-all chains were created with `cactus-hal2chains` (commit 723b7899abe450d6e0872d87fcd6852dc155cb76)

```
TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-hal2chains ./js_chains8 ./8-t2t-apes-2023v2.hal ./8-t2t-apes-2023v2-chains  --inMemory --bigChain  --caching false --logFile ./8-t2t-apes-2023v2-chains.log --batchLogsDir batch-logs --batchSystem slurm --coordinationDir /data/tmp 
```

## Naming

Both the HAL and MAF files use NCBI accessions for genome names.  These are completely unreadable.  To change the HAL genome names to be human-readable, use the following command (runs in 1 second):

```
halRenameGenomes 8-t2t-apes-2023v2.hal rename-hal.tsv
```

Renaming the MAF file(s) takes a little longer, but can be done with:

```
zcat 8-t2t-apes-2023v2.hs1.maf.gz | sed -e 's/GCA_028885655.2/mPonAbe1_pri/g' \
-e 's/GCA_029281585.2/mGorGor1_pri/g' \
-e 's/GCA_029289425.2/mPanPan1_pri/g' \
-e 's/GCA_028885625.2/mPonPyg2_pri/g' \
-e 's/GCA_028858775.2/mPanTro3_pri/g' \
-e 's/GCA_028878055.2/mSymSyn1_pri/g' | bgzip --threads 2 > 16-t2t-apes-2023v2.hs1.renamed.maf.gz
```

This renamed HAL was made as a stopgap for browser stuff (hopefully it is temporary)

```
cp 8-t2t-apes-2023v2.rename.tsv 8-t2t-apes-2023v2.rename.hal
halRenameGenomes 8-t2t-apes-2023v2.rename.hal 8-t2t-apes-2023v2.rename.tsv
Renaming GCA_028858775.2 to mPanTro3_v2.0
Renaming GCA_028878055.2 to mSymSyn1_v2.0
Renaming GCA_028885625.2 to mPonPyg2_v2.0
Renaming GCA_028885655.2 to mPonAbe1_v2.0
Renaming GCA_029281585.2 to mGorGor1_v2.0
Renaming GCA_029289425.2 to mPanPan1_v2.0
Renaming hs1 to T2T-CHM13v2.0
```

