# Updating the Zoonomia HAL with one new cat and five new canines (AKA 241-mammalian-2020v2.1.hal)

## The new assemblies:

```
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/350/175/GCA_018350175.1_F.catus_Fca126_mat1.0/GCA_018350175.1_F.catus_Fca126_mat1.0_genomic.fna.gz
http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/CanFam4.fa.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/725/GCF_003254725.2_ASM325472v2/GCF_003254725.2_ASM325472v2_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/319/855/GCA_905319855.2_mCanLor1.2/GCA_905319855.2_mCanLor1.2_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/146/905/GCA_905146905.1_NYPRO_anot_genome/GCA_905146905.1_NYPRO_anot_genome_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/311/455/GCA_017311455.1_Otocyon_megalotis_TS305_17_09_2019/GCA_017311455.1_Otocyon_megalotis_TS305_17_09_2019_genomic.fna.gz

```

## The input sequences

All assemblies already in the alignment (that need to be realigned) were obtained using

```
hal2fasta 241-mammalian-2020v2.hal <genome name> | bgzip --threads 32 > <genome name>.fa.gz
```

The remaining assemblies are from GenBank.  Before using them, they were masked with RepeatMasker using [repeatMaskerPipeline](https://github.com/ComparativeGenomicsToolkit/repeatMaskerPipeline) commit [b7b48a265dfcca57410f572e03b174ea3b7ca76d](https://github.com/ComparativeGenomicsToolkit/repeatMaskerPipeline/commit/b7b48a265dfcca57410f572e03b174ea3b7ca76d).  Note that this uses Dfam version 3.3. 

The RepeatMasker command line for the canines was

```
repeatMaskerPipeline aws:us-west-2:glennhickey-jobstore-rm ./output https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/725/GCF_003254725.2_ASM325472v2/GCF_003254725.2_ASM325472v2_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/319/855/GCA_905319855.2_mCanLor1.2/GCA_905319855.2_mCanLor1.2_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/146/905/GCA_905146905.1_NYPRO_anot_genome/GCA_905146905.1_NYPRO_anot_genome_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/311/455/GCA_017311455.1_Otocyon_megalotis_TS305_17_09_2019/GCA_017311455.1_Otocyon_megalotis_TS305_17_09_2019_genomic.fna.gz --species "Canis lupus dingo" "Canis lupus orion" "Nyctereutes procyonoides" "Otocyon megalotis" --provisioner aws --batchSystem mesos --maxNodes 25 --defaultPreemptable --nodeStorage 256 --nodeTypes c5.9xlarge:0.85 --realTimeLogging --logFile rm.log

and the cat
```
repeatMaskerPipeline aws:us-west-2:glennhickey-jobstore-rm ./output https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/350/175/GCA_018350175.1_F.catus_Fca126_mat1.0/GCA_018350175.1_F.catus_Fca126_mat1.0_genomic.fna.gz --species "Felis catus" --provisioner aws --batchSystem mesos --maxNodes 25 --defaultPreemptable --nodeStorage 256 --nodeTypes c5.9xlarge:0.85 --realTimeLogging --logFile rm-cat.log
```

These output a fasta file for each genome that has about 40-50% of the genome softmasked, like we expect from UCSC and what was used for the original Zoonomia data.  This masking includes any softmasking that was present in the GenBank assemblies. 

The urls for most of these assemblies below are probably ephemeral, but the sequences always live on in the Zoonomia HAL and GenBank if ever needed again.

## First things first

We make a local copy of 241-mammalian-2020v2.hal.  Any error will corrupt it beyond repair, so we need to be prepared to recopy it and start again (and again). 

## Adding the cat

Using `mash dist` (which works well on closely-related genomes) we get a distance of `0.00334352` between Felis_catus (felCat8) and the new F.catus_Fca126 assembly. So we estimate our new tree as


```
((Felis_catus:0.0017,Felis_catus_fca126:0.0017)fullTreeAnc202a:0.00399,Felis_nigripes:0.0054)fullTreeAnc202:0.00402;
```

from the old tree
```
(Felis_catus:0.00569,Felis_nigripes:0.0054)fullTreeAnc202:0.00402;
```

We'll recompute the subtree up to and including `fullTreeAnc202`:

The new alignment (`add-cat.txt`)
```
(((Felis_catus:0.0017,Felis_catus_fca126:0.0017)fullTreeAnc202a:0.00399,Felis_nigripes:0.0054)fullTreeAnc202:0.00402,Puma_concolor:0.00872)fullTreeAnc203;

Felis_catus http://public.gi.ucsc.edu/~hickey/add-cat/Felis_catus.fa.gz
Felis_catus_fca126 http://public.gi.ucsc.edu/~hickey/add-cat/GCA_018350175.1_F.catus_Fca126_mat1.0_genomic.fna.gz.masked
Felis_nigripes  http://public.gi.ucsc.edu/~hickey/add-cat/Felis_nigripes.fa.gz
Puma_concolor http://public.gi.ucsc.edu/~hickey/add-cat/Puma_concolor.fa.gz
fullTreeAnc203 http://public.gi.ucsc.edu/~hickey/add-cat/fullTreeAnc203.fa.gz
```

Now we make our WDL with Cactus ebd15ad4a4d5ec8853af39f4b7d9e415576fdfac. **Note the --includeRoot option**
```
cactus-prepare --wdl add-cat.txt --noLocalInputs --includeRoot fullTreeAnc203 --preprocessBatchSize 5 \
               --preprocessDisk 375Gi --preprocessCores 32 --preprocessMemory 120Gi \
               --blastDisk 375Gi --blastCores 32 --gpu --gpuCount 8 --blastMemory 120Gi \
               --alignDisk 375Gi --alignCores 64 --alignMemory 416Gi \
               --halAppendDisk 3000Gi  --defaultMemory 104Gi  --docker Imagequay.io/comparative-genomics-toolkit/cactus:v2.2.3  > add-cat.wdl
```

Then run it on Terra.

Now we remove the old subtree and add the new one, replacing everything below fullTreeAnc203:

**Note**: `halRemoveSubtree` is a new tool, and will only be in Cactus releases newer than 2.2.3. 

```
halRemoveSubtree 241-mammalian-2020v2.hal fullTreeAnc203 --inMemory
halAppendSubtree 241-mammalian-2020v2.hal new-cats.hal fullTreeAnc203 fullTreeAnc203 --merge --inMemory --noMarkAncestors
```


## Updating canines

We want to add `canFam4` (zoonomia uses `canFam3` for dog) along with 4 other dog genomes.  Unlike for cat, there's no indication that canFam3 and canFam4 are more closely related than either are to Canius_lupus (which is the misleadingly-named African Village Dog in Zoonomia):
```
canFam3 vs canFam4 : 0.00170624
canFam3 vs Canis_lupus : 0.00149601
canFam4 vs Canis_lupus : 0.00128842
```

So we punt on the question and just add `canFam4` as a sister to both canFam3 and Canis_lupus using canFam3's branch length:

```
(Canis_lupus_VD:0.00106,CanFam4:0.00403888,Canis_lupus_familiaris:0.00403888)fullTreeAnc240:0.00117797;
```

This should give a reasonable alignment.  The rest of the tree comes from Michael Dong (imbim.uu.se).  Note that we add fullTreeAnc221 as a leaf, as we'll need it in order to add back the sister clade. 

`add-dog.txt`
```
((((((Canis_lupus_VD:0.00106,CanFam4:0.00403888,Canis_lupus_familiaris:0.00403888)fullTreeAnc207c:0.00117797,Canis_lupus_dingo:0.00521685)fullTreeAnc207b:0.00117797,Canis_lupus_orion:0.00639482)fullTreeAnc207a:0.00117797,Lycaon_pictus:0.00757278)fullTreeAnc207:0.00353390,((Nyctereutes_procyonoides:0.00947685,Vulpes_lagopus:0.00947685)fullTreeAnc208:0.00111193,Otocyon_megalotis:0.01058877)fullTreeAnc209a:0.00051791)fullTreeAnc209:0.03894904,fullTreeAnc221:0.01)fullTreeAnc222;


Canis_lupus_VD http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/Canis_lupus.fa.gz 
CanFam4 http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/CanFam4.fa.gz
Canis_lupus_familiaris http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/Canis_lupus_familiaris.fa.gz
Lycaon_pictus http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/Lycaon_pictus.fa.gz
Vulpes_lagopus http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/Vulpes_lagopus.fa.gz
Canis_lupus_dingo http://public.gi.ucsc.edu/~hickey/debug/dog-update/merge-masked/GCA_003254725.1_ASM325472v1_genomic.fna.gz.fullmask.fa.gz
Canis_lupus_orion http://public.gi.ucsc.edu/~hickey/debug/dog-update/merge-masked/GCA_905319855.2_mCanLor1.2_genomic.fna.gz.fullmask.fa.gz
Nyctereutes_procyonoides http://public.gi.ucsc.edu/~hickey/debug/dog-update/merge-masked/GCA_905146905.1_NYPRO_anot_genome_genomic.fna.gz.fullmask.fa.gz
Otocyon_megalotis http://public.gi.ucsc.edu/~hickey/debug/dog-update/merge-masked/GCA_017311455.1_Otocyon_megalotis_TS305_17_09_2019_genomic.fna.gz.fullmask.fa.gz
fullTreeAnc222 http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/fullTreeAnc222.fa.gz
fullTreeAnc221 http://public.gi.ucsc.edu/~hickey/debug/dog-update/zoonomia/fullTreeAnc221.fa.gz
```

Now we make our WDL with Cactus v.2.2.3 (once again `--includeRoot` is added to use the exising fullTreeAnc222.
```
cactus-prepare --wdl add-dog.txt --noLocalInputs --includeRoot --preprocessBatchSize 5 \
               --preprocessDisk 375Gi --preprocessCores 32 --preprocessMemory 120Gi \
               --blastDisk 375Gi --blastCores 32 --gpu --gpuCount 8 --blastMemory 120Gi \
               --alignDisk 375Gi --alignCores 64 --alignMemory 416Gi \
               --halAppendDisk 3000Gi  --defaultMemory 104Gi --docker Imagequay.io/comparative-genomics-toolkit/cactus:v2.2.3 > add-dog.wdl
```

We need to remember the `fullTreeAnc221` clade because we're about to zap it.

```
halExtract 241-mammalian-2020v2.hal fullTreeAnc221.hal --root fullTreeAnc221 --inMemory

```

Then we add the dogs

```
halRemoveSubtree 241-mammalian-2020v2.hal fullTreeAnc222 --inMemory
halAppendSubtree 241-mammalian-2020v2.hal new-dogs.hal fullTreeAnc222 fullTreeAnc222 --merge --inMemory --noMarkAncestors
```

Then we add back fullTreeAnc221
```
halAppendSubtree 241-mammalian-2020v2.hal fullTreeAnc221.hal fullTreeAnc221 fullTreeAnc221 --merge --inMemory --noMarkAncestors
```

All the genomes we overwrote are still dangling in `241-mammalian-2020v2.hal`.  They aren't causing harm, but are taking up lots of space (almost 40 genomes worth).  We can get rid of them, but it will be slow.

```
halExtract 241-mammalian-2020v2.hal 241-mammalian-2020v2.1.hal --root $(halStats --root 241-mammalian-2020v2.hal)  --inMemory
```

