Precision-recall curves

Evaluation stats from hap.py

giraffe

Type TRUTH.TP TRUTH.FN QUERY.FP METRIC.Precision METRIC.Recall METRIC.F1_Score
INDEL 502129 2372 1540 0.997066 0.995298 0.996181
SNP 3315258 12238 4818 0.998550 0.996322 0.997435

bwamem

Type TRUTH.TP TRUTH.FN QUERY.FP METRIC.Precision METRIC.Recall METRIC.F1_Score
INDEL 501470 3031 1380 0.997367 0.993992 0.995677
SNP 3306945 20551 6042 0.998177 0.993824 0.995996

Evaluation stats from hap.py’s VCF

From the variants marked as TP/FP/FN in the annotated VCF that hap.py produces.

giraffe

type FN FP TP precision recall F1
INDEL 2372 1540 502129 0.9969 0.9953 0.9961
SNP 12238 4818 3315258 0.9985 0.9963 0.9974

bwamem

type FN FP TP precision recall F1
INDEL 3031 1380 501470 0.9973 0.9940 0.9956
SNP 20551 6042 3306945 0.9982 0.9938 0.9960

Looks almost exactly like the metrics computed by hap.py. Good, it means I could filter these annotated variants to get a quick estimate of the performance in different regions.

Regions

Segmental duplications

From UCSC genomicSuperDups track.

  • sd: all segmental duplications
  • sd99: segmental duplication with fracMath>0.99

GIAB stratifications

From https://github.com/genome-in-a-bottle/genome-stratifications

  • MHC
  • AllTandemRepeatsandHomopolymers_slop5
  • AllTandemRepeats_gt100bp_slop5
  • L1H_gt500
  • alldifficultregions
  • alllowmapandsegdupregions
  • lowmappabilityall

Size of the region sets

region n Mbp
sd 69894 910.757532
sd99 2491 131.811504
MHC 1 4.970558
AllTandemRepeatsandHomopolymers_slop5 4689843 254.038446
AllTandemRepeats_gt100bp_slop5 213368 120.148167
L1H_gt500 1021 3.289135
alldifficultregions 4810858 643.469280
alllowmapandsegdupregions 529793 306.144596
lowmappabilityall 673815 249.550584

Overlap variants with regions

Graphs sorted to highlight regions with the biggest difference in performance between methods.

Note: this is based on the annotated variants from one hap.py run on the confident regions. To get slightly more accurate estimates of the performance (and ROC curves for eg) for a particular region set, we should rerun hap.py on them. This is just a quicker way used for exploration.

Table

F1

FP + FN

TP

TP + FP

Examples

Segmental duplications

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   13.82    5.00  453.00

L1H_gt500

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   5.797   7.000  96.000

MHC

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    8.00   21.34   25.00  332.00

MHC regions with FP/FN with giraffe

CYP2D6/7 genes

Clusters of no-calls

Looking for regions were most variants are FN in the giraffe-HPRC run but not in the bwa-mem run.

10 kbp regions

coord prop.fn n.fn n.tot prop.fn.bwamem
chr1:161757353-161767351 1.0000000 15 15 0.0000000
chr1:161767352-161777351 1.0000000 23 23 0.0000000
chr6:32645857-32655855 1.0000000 109 109 0.0180180
chr7:38353860-38363858 1.0000000 11 11 0.0000000
chr9:64426127-64436125 1.0000000 12 12 0.0833333
chr9:64466125-64476123 1.0000000 12 12 0.0666667
chr15:84392425-84402422 1.0000000 67 67 0.0000000
chr16:22189304-22199302 1.0000000 12 12 0.0000000
chr16:28498701-28508699 1.0000000 12 12 0.0000000
chr16:22119311-22129309 0.9285714 13 14 0.0000000
chr17:36210775-36220774 0.9285714 13 14 0.0000000
chr3:195911092-195921091 0.9230769 12 13 0.0000000
chr16:22229300-22239299 0.9230769 12 13 0.0000000
chr12:9448483-9458481 0.9090909 20 22 0.0000000
chr16:22159307-22169305 0.9090909 10 11 0.0000000
chr9:67046008-67056006 0.9047619 19 21 0.0476190
chr16:28518699-28528697 0.9047619 19 21 0.0000000

100 kbp regions

Same analysis with larger bins

coord prop.fn n.fn n.tot prop.fn.bwamem
chr9:64427487-64527452 1.0000000 97 97 0.1825397
chr9:64827353-64927318 1.0000000 45 45 0.1886792
chr16:21826922-21926900 1.0000000 31 31 0.3589744
chr15:84334575-84434550 0.9758065 121 124 0.1653543
chr9:66326851-66426816 0.9545455 42 44 0.1666667
chr16:21327025-21427004 0.9000000 36 40 0.2000000
chr9:67326516-67426481 0.8923077 58 65 0.4492754
chr9:67126583-67226548 0.8391608 120 143 0.2402597
chr16:21626963-21726942 0.8367347 41 49 0.0000000
chr8:12290888-12390866 0.8317757 89 107 0.1538462
chr16:22126860-22226839 0.8317757 89 107 0.0000000
chr16:21726943-21826921 0.8125000 13 16 0.0909091
chr15:22449292-22549268 0.8076923 21 26 0.2750000