# 447-mammalian-2022v1.hal

This file was created by combining the 243-way primates alignment, `243-primates.hal`, from [cite Rashid et al. Identification of conserved sequence elements across 230 primate genomes with deep learning] with `241-mammalian-2020v2.1.hal`.

The primates clade in Zoonomia has root `fullTreeAnc110`. 

We begin by making a "bridge" alignment between the two.  Note that ancestral sequences where specified were obtained using `hal2fasta` on the appropriate alignment. `fullTreeAnc112` is the parent of `fullTreeAnc110` in the zoonomia alignment and `Galeopterus_variegatus` is `fullTreeAnc112`'s other child. 

replace-primates.txt
```
(primates_root:0.105,Galeopterus_variegatus:0.12765)fullTreeAnc112;

primates_root primates_root.fa
fullTreeAnc112 fullTreeAnc112.fa
Galeopterus_variegatus Galeopterus_variegatus.fa

```

Then we run `replace-primates.txt` through cactus making sure to use `--includeRoot` to preserve `fullTreeAnc112` exactly as it is in Zoonomia:

```
/home/ubuntu/cactus-bin-v2.2.1/cactus_env/bin/cactus-blast ./js ./replace-primates.txt replace-primates.paf --root fullTreeAnc112 --includeRoot --realTimeLogging --logFile paf.log

/home/ubuntu/cactus-bin-v2.2.1/cactus_env/bin/cactus-align ./js ./replace-primates.txt replace-primates.paf replace-primates.hal --root fullTreeAnc112 --includeRoot --realTimeLogging --logFile hal.log
```

`halRenameGenomes` was then used to rename all the ancestors to have the form `PrimatesAncXXX` instead of `AncXXX` (and `PrimatesAnc000` instead of `primates_root`) in both replace-praimtes.hal and illumina-primates.hal.

Next the existing primates clade, its parent, and sister,  were removed from the Zoonomia alignment.  (note the following commands need a newer HAL than included in cactus v2.2.1, https://github.com/ComparativeGenomicsToolkit/hal/commit/375b84f51f9c20e0af85945988da7d0b0ff9876c)

```
halRemoveSubtree 241-mammalian-2020v2.1.hal fullTreeAnc112.fa --inMemory
```

We then replace `fullTreeAnc112` and everything below with our new alignment.

```
halAppendSubtree 241-mammalian-2020v2.1.hal replace-primates.hal fullTreeAnc112 fullTreeAnc112 --merge --noMarkAncestors

```

Then we add the new primates alignment in a similar manner:
```
halAppendSubtree  241-mammalian-2020v2.1.hal 243-primates.hal PrimatesAnc000 PrimatesAnc000 --merge --noMarkAncestors
```

Finally, we "defragment" the HAL file to free up space from the deleted genomes
```
halExtract 241-mammalian-2020v2.1.hal 447-mammalian-2022v1.hal --inMemory
```

# 447-mammalian-2022v1.maf.gz

This file was created with this command
```
cactus-hal2maf ./js ../447-mammalian-2022v1.hal 447-cmammalian-2022v1.maf.gz --chunkSize 100000 --batchCores 96 --batchCount 10 --noAncestors --filterGapCausingDupes --batchParallelTaf 32 --batchSystem slurm --maxLocalJobs 800 --refGenome hg38 --logFile 447-cmammalian-2022v1.maf.gz.log
```

using Cactus commit a33d3eabb909873746ecd8e7e1528344e526d95b and Toil commit 356261d5ac84913d4c4f897d990c285e2a7e11f6.  Note that the `Homo_sapiens` genome was renamed to `hg38` with `halRenameGenomes` beforehand. 

The index, `447-mammalian-2022v1.maf.gz.tai` was created with the same Cactus commit using

```
taffy index -i 447-mammalian-2022v1.maf.gz
```

# 447-mammalian-2022v1.bigmaf.bb

(Update: This file not shared since it was regenerated by the Browser)

The BigMaf version was  generated with
```
cactus-maf2bigmaf ./js1 ./447-mammalian-2022v1.maf.gz ./447-mammalian-2022v1.bigmaf.bb --refGenome hg38 --chromSizes ./hg38.chroms --logFile ./447-mammalian-2022v1.bigmaf.bb.log --workDir work1
```

using Cactus commit 194e31fa75f243a6ffe2bd1b74578b167bae0adb.

# 447-mammalian-2022v1-single-copy.maf.gz

This file was created using 

```
zcat 447-mammalian-2022v1.maf.gz | mafDuplicateFilter -k -m - | bgzip > 447-mammalian-2022v1-single-copy.maf.gz
```

and as such, has at most one row per species per block. It was indexed using `taffy` as described above. 

# Renaming

The following naming errors were found, where the left name is what was in the alignment, and the right name is the corrected version:

```
Atele_fusciceps Ateles_geoffroyi
Cercopithecus_mitis     Cercopithecus_albogularis
Hoolock_hoolock Hoolock_leuconedys
Hylobates_lar   Hylobates_pileatus
Mico_spnv (Mico schneideri)     Mico_schneideri
Nomascus_leucogenys     Nomascus_siki
Pygathrix_nemaeus       Pygathrix_nigripes
```

To make matters more complicated, this causes 4 duplicate species:
```
Ateles_geoffroyi
Hylobates_pileatus
Nomascus_siki
Pygathrix_nigripes
```

The following corrections were applied to make these corrections while using _a / _b suffixes to make the duplicated species unique

For the MAFs
```
zcat 447-mammalian-2022v1.maf.gz.bak | sed -e 's/Ateles_geoffroyi/Ateles_geoffroyi_a/g' -e 's/Hylobates_pileatus/Hylobates_pileatus_a/g' -e 's/Nomascus_siki/Nomascus_siki_a/g' -e 's/Pygathrix_nigripes/Pygathrix_nigripes_a/g' -e 's/Atele_fusciceps/Ateles_geoffroyi_b/g' -e 's/Cercopithecus_mitis/Cercopithecus_albogularis/g' -e 's/Hoolock_hoolock/Hoolock_leuconedys/g' -e 's/Hylobates_lar/Hylobates_pileatus_b/g' -e 's/Mico_spnv/Mico_schneideri/g' -e  's/Nomascus_leucogenys/Nomascus_siki_b/g' -e 's/Pygathrix_nemaeus/Pygathrix_nigripes_b/g' | bgzip --threads 2 > 447-mammalian-2022v1.maf.gz
```

And for the HAL

```
halRenameGenomes 447-mammalian-2022v1.hal 447-mammalian-rename.tsv
```

where `447-mammalian-rename.tsv` is

```
Ateles_geoffroyi	Ateles_geoffroyi_a
Hylobates_pileatus	Hylobates_pileatus_a
Nomascus_siki		Nomascus_siki_a
Pygathrix_nigripes	Pygathrix_nigripes_a
Atele_fusciceps		Ateles_geoffroyi_b
Cercopithecus_mitis	Cercopithecus_albogularis
Hoolock_hoolock		Hoolock_leuconedys
Hylobates_lar		Hylobates_pileatus_b
Mico_spnv		Mico_schneideri
Nomascus_leucogenys	Nomascus_siki_b
Pygathrix_nemaeus	Pygathrix_nigripes_b
```

# 447-mammalian-2022v1.fix.maf.gz (Update: 12/11/2023)

There's a bug in `447-mammalian-2022v1.maf.gz` and `447-mammalian-2022v1.single-copy.maf.gz` where the length field is wrong in some blocks.  The only fix is to re-convert the HAL to maf with a patched version of the tool, which is what was done here. More details: https://github.com/ComparativeGenomicsToolkit/cactus/issues/1201

This file was created (from the renamed, see above, HAL) with this command using Cactus v2.6.12                                         
```                                                                                                                                    
TOIL_SLURM_ARGS="--partition long --time 10000" cactus-hal2maf ./js 447-mammalian-2022v1.hal 447-mammalian-2022v1.fix.maf.gz --chunkSize 100000 --batchCores 96 --batchCount 10 --noAncestors --filterGapCausingDupes --batchParallelTaf 32 --batchSystem slurm --refGenome hg38 --logFile 447-mammalian-2022v1.fix.maf.gz.log                              
```

# 447-mammalian-2022v1.single-copy.fix.maf.gz (Update: 12/11/2023)

This file was created (from the renamed, see above, HAL) with this command using Cactus v2.6.12 (it was faster on server to regenerate than to compute from `mafDuplicateFilter -km 447-mammalian-2022v1.fix.maf`

```
TOIL_SLURM_ARGS="--partition long --time 10000" cactus-hal2maf ./js1 447-mammalian-2022v1.hal 447-mammalian-2022v1-single-copy.fix.maf.gz --chunkSize 100000 --batchCores 96 --batchCount 10 --noAncestors --dupeMode single --batchParallelTaf 32 --batchSystem slurm --refGenome hg38 --logFile 447-mammalian-2022v1-single-copy.fix.maf.gz.log
```

