# 7-way T2T apes primary alignment

Despite never having an issue with earlier versions of the assemblies, this dataset will just not run through with the current version of Cactus (v2.7.1).  The issue seems to be an unmasked repeat area in gorilla that is causing a giant collapse in the pairwise alignments between it and CHM13.  This leads to the paffy processing running out of memory or, if that's hacked around, cactus bar crashing (too many rows for abpoa?).

The following is a work-around to derive the 7-way tree from the 8-way tree by re-aligning everything below Anc4, but keeping Anc4 the same as what it was in the 8-way.  Anc4 is the ancestor of human, chimp and bonobo.

# Recompute Anc5

First, extract relevant sequences:

```
for g in hs1 GCA_029289425.2 GCA_028858775.2 Anc4 ; do echo $g ; done | parallel -j4 "hal2fasta 8-t2t-apes-2023v2.hal {} | bgzip > {}.fa.gz"
```

Now make an Anc5 (bonobo-chimp) subproblem with human and Anc4 as an outgroup.  Note we manually add Anc4 as a leaf so it can be used as an outgroup.

```
printf "((GCA_029289425.2:0.002009999999999998,GCA_028858775.2:0.0022900000000000004)Anc5:0.00435,hs1:0.006489999999999999,Anc4:0.0000001)root;\n\n" > 7-t2t-apes-2023v2.Anc5.seqfile

for g in hs1 GCA_029289425.2 GCA_028858775.2 Anc4; do printf "${g}\t${g}.fa.gz\n" >> 7-t2t-apes-2023v2.Anc5.seqfile ; done

```
cp /private/groups/cgl/cactus/cactus-bin-v2.7.1/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
sed -i config-slurm.xml -e 's/blast chunkSize="30000000"/blast chunkSize="90000000"/g'
sed -i config-slurm.xml -e 's/dechunkBatchSize="1000"/dechunkBatchSize="200"/g'
```

We use

`--skipPreprocessor` since our input's coming out of a HAL
`--root Anc5` to only compute the subtree we're intrested in

and run the alignment with cactus-v2.7.1

```
TOIL_SLURM_ARGS="--partition=long --time=8000" cactus ./js-7apes ./7-t2t-apes-2023v2.Anc5.seqfile ./7-t2t-apes-2023v2.Anc5.hal --batchSystem slurm --caching false --consCores 64 --configFile ./config-slurm.xml --logFile 7-t2t-apes-2023v2.Anc5.hal.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp --skipPreprocessor --root Anc5
```

# Recompute the Anc4,Anc5,hs1 alignment (but leaving Anc4 fixed)

```
hal2fasta ./7-t2t-apes-2023v2.Anc5.hal Anc5 | bgzip > Anc5.fa.gz
```

```
printf "(Anc5:0.00435,hs1:0.006489999999999999)Anc4;\n\n" > 7-t2t-apes-2023v2.Anc4.seqfile

for g in hs1 Anc4 Anc5; do printf "${g}\t${g}.fa.gz\n" >> 7-t2t-apes-2023v2.Anc4.seqfile ; done
```

```
TOIL_SLURM_ARGS="--partition=long --time=8000" cactus ./js-7apes ./7-t2t-apes-2023v2.Anc4.seqfile ./7-t2t-apes-2023v2.Anc4.hal --batchSystem slurm --caching false --consCores 64 --configFile ./config-slurm.xml --logFile 7-t2t-apes-2023v2.Anc4.hal.log --batchLogsDir batch-logs-8apes --coordinationDir /data/tmp --skipPreprocessor 
```

# Put together the new tree

First, remove everything below Anc4 from the 8-way

```
cp 8-t2t-apes-2023v2.hal 7-t2t-apes-2023v2.hal
halRemoveSubtree 7-t2t-apes-2023v2-top.hal Anc4
```

Then add in the Anc4 subtree
```
halAppendSubtree 7-t2t-apes-2023v2.hal 7-t2t-apes-2023v2.Anc4.hal Anc4 Anc4 --merge --inMemory
```

Then add in the Anc5 subtree
```
halAppendSubtree 7-t2t-apes-2023v2.hal 7-t2t-apes-2023v2.Anc5.hal Anc5 Anc5 --merge --inMemory
```

Then defragment the tree. This doesn't affect the data, but will make the file smaller by removing unlinked genomes

```
halExtract 7-t2t-apes-2023v2.hal 7-t2t-apes-2023v2.defrag.hal --root Anc0 --inMemory
mv 7-t2t-apes-2023v2.defrag.hal 7-t2t-apes-2023v2.hal
```

