Tier 1

Chr. aberrations

If indexcov was run: coverage estimates across the chromosomes to identify chromosomal aberrations

Of note, some arms are not well assembled so it's normal to see low coverage in: 13p, 14p, 15p, 22p. Same for the the last chunk of chrYq and the beginning of 21p.

Coding

Variants overlapping coding sequence of protein-coding genes, including genes of interest. Gene-centric table: one row per affected gene, which means a variant can contribute multiple rows in the table.

Recommended: filter out large SVs using column large.

Tier 2

UTR/promoter

Variants overlapping UTR or promoter ([TSS-2000bp,TSS+200bp]) of protein-coding genes but not coding. Gene-centric table: one row per affected gene, which means a variant can contribute multiple rows in the table.

Recommended: filter out large SVs using column large.

Intronic

Variants overlapping intronic regions but no coding, UTR or promoter of protein-coding genes. We filtered out intronic variants that are not within genes of interest AND don't overlap a regularoty/conserved region AND are located further than 1 kbp from coding sequences. Gene-centric table: one row per affected gene, which means a variant can contribute multiple rows in the table.

Recommended: filter out large SVs using column large.

Tier 3

Rare of interest

All variants that don't overlap genes of interest but are close (<100kbp).

Recommended: filter out large SVs using column large.

Conserved/regulated

Intergenic variants overlapping conserved regions or known regulatory regions.

Recommended: filter out large SVs using column large.

Large

Large and rare variants tend to have a higher biological impact. This table shows rare variants larger than 1kbp. Interesting profiles to look for:

  • large deletions spanning a CTCF binding region (could lead to TAD reorganization and ectopic gene expression).
  • Overlap with pathogenic CNV in the clinical SV list. clinsv column shows variants with 50% reciprocal overlap with a pathogenic CNV.
  • Large SV affecting multiple protein-coding genes.

Methods

SV calling

The structural variants were called from the nanopore reads using Sniffles. indexcov provides quick estimates of read coverage to confirm large CNVs or identify chromosomal aberrations.

To select for higher confidence SVs, increase the minimum quality (corresponding to the read support, RE in Sniffles' VCF). In the tables, the qual column has been winsorized at 100 to help selecting the practical ranges. The distribution of quality for all variants looks like this:

In this report, we don't consider variation in alternate chromosomes or the mitochondrial genome.

SV database and frequency annotation

The variants were compared to catalogs of known SVs. The frequency estimates are based on the gnoma-SV catalog, as the maximum frequency of variants with reciprocal overlap >10%. We filter out calls that match the GIAB (Zook et al. 2020) public catalog and 11 in-house control genomes (nanopore + Sniffles) to remove variants that pass the frequency filter based on gnomAD-SV simply because they can't be detected by short reads. Finally, variants are flagged if overlapping DGV (any overlap) or Clinival SVs (nstd102 dbVar) (reciprocal overlap>50%).

Gene annotation

The GENCODE gene annotation was used to flag variants as coding, UTR, promoter, intronic (prioritized in this order) and compute the distance to the nearest gene. While we consider lncRNA, miRNA in the annotation (genes column), most tables focus on protein-coding genes.

The pLI score was computed by the gnomAD project. It represents the probability that the gene is intolerant to loss-of-function variants. A variant affecting a gene with a high pLI score (e.g. >0.9) is more likely to have a biological impact. The variants in each section are ordered to show those affecting genes with the highest pLI first.

Genes of interest

There are 840 genes in 4 gene lists:

gene_list genes
centoicu 514
exp-screen-sema4 283
neoepi 275
custom 2

They contain the following genes:

Filters

In most tables, we removed common variants, i.e. either:

  • frequency higher than 1% in gnomAD-SV
  • seen in the SV catalog from long-read studies

Column names

  • pLI prob of loss-of-function intolerance described above (-1 if no information available).
  • type SV type
  • ac allele count: 1 for het or 2 for hom.
  • qual the call quality. Corresponds to the read support (RE field in the Sniffles' VCF)
  • freq allele frequency of similar SV in gnomAD-SV (>10K genomes sequenced with Illumina whole-genome sequencing).
  • dgv does the variant overlap any variant in DGV in any way?
  • clinsv any similar clinical SV variants (nstd102 in dbVar)? "Similar" defined as reciprocal overlap > 50%.
  • ctcf does the variant overlap a CTCF binding site? From ENCODE track for kidney.
  • cres does the variant overlap a regulatory region? From ENCODE track for kidney.
  • simp.rep is the variant in or close to a simple repeat (see simple repeat track in the UCSC Genome Browser).
  • cons does the variant overlap a conserved element (as defined by 100 vertebrate phastCons track)
  • impact potential impact based on gene annotation: coding, UTR, promoter, intronic.
  • genes a summary of the genes overlapped by the variant and their impact.
  • gene.dist distance to the nearest gene which is specified by gene.near.
  • sel.gene.dist distance to the nearest gene of interest which is specified by sel.gene.near.
  • cds.dist distance to the nearest coding regions, useful for intronic variants for example.
  • nb.pc.genes number of protein-coding genes overlapped by the variant.
  • cov median scaled coverage for large variants (>100 kbp), if indexcov results are available. Between parenthesis is the number of bin overlapping the variant (the higher the more confident).
  • large a boolean value flagging large SVs (>100 kbp). Because they often overlap hundreds of genes and are potentially false-positives, it is convenient to remove them from the table using this column. There is a tab specifically about large SVs that is more appropriate to investigate those.

TSV file

In addition to the report, a gene-centric TSV file is written with all the annotation described above. Gene-centric means that there is one row for each gene-variant pair. This helps filter on gene features (e.g. gene of interest, pLI).

Tiers

  • Tier 1: Deletions + Insertions in Exons and aneuploidy.
  • Tier 2: INS + DEL+ tandemDUP in genes (Introns + Exons).
  • Tier 3: Genome wide range. The genome wide scope of SV including rearrangements (Inversions, Translocations). These events will be hard to interpret but maybe useful to track over time.