Mutation detection and interpretation
An educational reference on how variants are detected and how they are interpreted. The page covers Sanger sequencing, short-read next-generation sequencing, long-read sequencing, whole-exome and whole-genome scope, variant calling, annotation, the published ACMG/AMP five-tier classification framework, ClinVar, gnomAD, the major in silico pathogenicity predictors (REVEL, CADD, AlphaMissense), the ClinGen variant-curation infrastructure, and the GA4GH Variation Representation Specification. The page presents these resources as the published interpretive standard the field uses; the framework is not a service Evagene offers.
Short version. Sanger sequencing remains the confirmation gold standard for individual variants; short-read next-generation sequencing on Illumina platforms is the screening workhorse; long-read sequencing (PacBio HiFi, Oxford Nanopore) resolves repeat expansions and structural variants short reads miss. Variant interpretation is standardised by the ACMG/AMP 2015 framework (Richards et al., Genetics in Medicine 17:405), refined gene-by-gene by ClinGen variant-curation expert panels, and surfaced via ClinVar. Modern in silico predictors (REVEL, CADD, AlphaMissense) feed the framework. gnomAD supplies the population-frequency denominator. The framework is the published interpretive standard the field uses; this page describes it for educational purposes.
Sanger sequencing
Frederick Sanger's chain-termination method (Sanger, Nicklen, Coulson 1977) was the dominant sequencing technology of the 1980s and 1990s and remains the confirmation standard for individual candidate variants identified by next-generation sequencing. The method incorporates dideoxynucleotide terminators at random positions during DNA synthesis, generating a population of fragments differing by one nucleotide that are size-resolved by capillary electrophoresis with fluorescent detection. Read lengths reach 800–1,000 base pairs at high accuracy (Phred Q>40 over most of the read), and any single locus can be sequenced from PCR product on a per-sample, per-amplicon basis.
Sanger sequencing's role today is targeted: confirmation of NGS-called variants in a clinical accreditation context (where a second orthogonal method may be required to issue a result), targeted resequencing of a single locus in a research follow-up, and educational use, where the trace remains a more interpretable artefact for teaching nucleotide-by-nucleotide reading than the alignment summary of an NGS pipeline.
Short-read next-generation sequencing
Short-read sequencing on the Illumina platform is the workhorse of human genetics. The reversible-terminator chemistry was published by Bentley et al. 2008 (Nature 456:53), describing the Solexa platform that became Illumina, and demonstrated paired-end sequencing of an individual human genome at roughly 30× coverage. Modern Illumina platforms produce reads of 150–300 base pairs at very high throughput (terabases per run on the highest-capacity instruments) and at very high single-base accuracy (Q30 or better over most of the read).
Short-read sequencing's strengths are throughput, cost per base, and accuracy. Its limitations are read length and the difficulty of mapping repetitive sequence: tandem repeats longer than the read length cannot be resolved unambiguously; structural variants whose breakpoints fall inside repetitive sequence are unreliably detected; and large copy-number changes are inferred indirectly through coverage rather than directly through read structure. These are the gaps that long-read sequencing addresses.
Long-read sequencing
Long-read platforms produce reads spanning tens of kilobases per molecule, at the cost of higher per-base error rate (in earlier-generation platforms; PacBio HiFi reads now achieve Q30 single-molecule accuracy through circular consensus). The two dominant platforms are PacBio HiFi (single-molecule real-time sequencing of circularised DNA, generating a high-accuracy consensus from multiple passes around the same molecule) and Oxford Nanopore (single-molecule sequencing through a protein nanopore, which also enables direct detection of base modifications such as methylation).
Long reads resolve what short reads cannot: trinucleotide repeat expansions in HTT, FMR1, FXN, DMPK, and C9orf72; structural-variant breakpoints in segmental duplications and pericentromeric regions; haplotype phasing across distances longer than the genomic mean for heterozygous variants; and the centromeric and telomeric sequences that the T2T-CHM13 reference (Nurk et al. 2022, Science) finally completed for the human genome. Long reads are increasingly the platform of choice for diagnostic odyssey cases unsolved by short-read exome sequencing.
Whole exome vs whole genome
Whole-exome sequencing (WES) targets the protein-coding portion of the genome (approximately 1% of total genomic DNA, roughly 30 megabases) by hybridisation capture of exonic probes. Its strength is cost-efficiency at high coverage of the regions where the great majority of known Mendelian disease variants live; its weakness is that it misses non-coding variation, struggles with copy-number variation across the captured regions, and depends on the completeness and accuracy of the exome target file.
Whole-genome sequencing (WGS) sequences the entire genome at lower per-base cost than well-covered WES and provides more uniform coverage, including the non-coding regions that contain regulatory variants, structural-variant breakpoints, and deep-intronic splicing variants. WGS is also better suited to copy-number variant detection by depth-of-coverage analysis. The cost differential between WES and WGS has narrowed materially across the late 2010s and early 2020s, and WGS has become the platform of choice for many large research programmes (Genomics England's 100,000 Genomes Project, the All of Us Research Program, the UK Biobank whole-genome cohort).
Variant calling
Sequencing reads are aligned to a reference genome (GRCh38 / hg38, increasingly being supplemented or replaced by T2T-CHM13 and graph-genome references) and variants are then called from the alignment. The Genome Analysis Toolkit (GATK) HaplotypeCaller, developed at the Broad Institute, has been the dominant short-variant caller and uses local de novo assembly around indels with logistic-regression-based filtering. DeepVariant (Poplin et al. 2018, Nature Biotechnology 36:983), developed at Google, applies a convolutional neural network to pile-up images and has shown improved precision on indels and on platforms with non-standard error profiles.
Structural-variant calling is a separate and harder problem. Tools such as Manta, DELLY, and the long-read-specific Sniffles and pbsv aggregate split-read, discordant-pair, depth-of-coverage, and de novo assembly evidence to call larger events. Copy-number variant calling from short-read data uses depth-of-coverage normalisation against a reference panel and is sensitive to capture-uniformity artefacts. Mosaic and somatic variant calling (in cancer or developmental mosaicism) requires distinct callers (Mutect2, Strelka2, deep-coverage approaches) that handle low variant allele fraction and tumour-normal subtraction.
Variant annotation
A called variant is a coordinate-and-genotype tuple; annotation translates it into biology. The dominant tools are Ensembl Variant Effect Predictor (VEP) at the EBI (described by McLaren et al. 2016, Genome Biology 17:122), ANNOVAR from the Wang laboratory, and SnpEff. VEP integrates with the Ensembl gene model, the SO sequence-ontology consequence terms, and a panel of plug-in scoring tools that bring REVEL, CADD, SpliceAI, AlphaMissense, gnomAD, and ClinVar annotations into the same record. The output is a per-variant report giving HGVS coding and protein nomenclature, predicted molecular consequence, and a panel of supporting evidence.
The ACMG/AMP framework
Variant interpretation in clinical and research genetics is standardised by the 2015 ACMG/AMP guidelines (Richards et al., Genetics in Medicine 17:405). The framework defines a five-tier classification — pathogenic, likely pathogenic, variant of uncertain significance (VUS), likely benign, benign — assembled from a coded set of pathogenic and benign criteria of varying strength.
Pathogenic criteria range from PVS1 (very strong; null variant in a gene where loss of function is a known mechanism) through PS (strong; well-established functional studies showing damaging effect; segregation in multiple families; etc.), PM (moderate; located in a mutational hotspot; absent from controls; etc.), down to PP (supporting; multiple computational lines of evidence; segregation in a single family; etc.). Benign criteria are organised symmetrically: BA1 (stand-alone; allele frequency above a frequency threshold incompatible with disease), BS (strong), BP (supporting). A combinatorial rule combines the criteria into the final five-tier classification.
The framework was refined by the ClinGen Sequence Variant Interpretation working group, which has published detailed gene-specific specifications for individual disease genes (PVS1 decision tree, BRCA1/2 specifications, RASopathy gene specifications, mitochondrial-DNA specifications, splicing impact framework). It has also been quantified by the Bayesian framework of Tavtigian et al. 2018 (Genetics in Medicine 20:1054), which expresses each criterion as a likelihood ratio under an explicit prior and combines them by Bayes' rule. The Bayesian formulation reproduces the categorical 2015 rules at the default likelihood ratios and supports calibrated combination of in silico predictor evidence at non-default strengths.
ClinGen, ClinVar, OMIM, DECIPHER
The community infrastructure for variant interpretation is held by a small number of public resources:
- ClinGen — gene-disease validity curation, dosage-sensitivity assessment, variant-curation expert panel infrastructure. Defines the gene-specific specifications of the ACMG/AMP framework.
- ClinVar — the public archive of variant interpretations, described by Landrum et al. 2018 (Nucleic Acids Research 46:D1062). Submissions are stratified by review status (one star to four stars; expert-panel and practice-guideline submissions hold the highest tiers).
- OMIM — the Online Mendelian Inheritance in Man catalogue of genes and genetic disorders.
- DECIPHER — a database of submitted patient-level structural variants and dosage-sensitivity information, hosted at the Sanger Institute.
- gnomAD — described by Karczewski et al. 2020 (Nature 581:434), the population-frequency denominator and gene-level constraint resource.
In silico pathogenicity predictors
Computational predictors estimate the likelihood that a given missense (or other) variant disrupts protein function. The major predictors used in current ACMG/AMP practice:
- REVEL — an ensemble meta-predictor for missense variants, combining 13 individual scores via a random forest, published by Ioannidis et al. 2016 (American Journal of Human Genetics 99:877). ClinGen has calibrated REVEL thresholds against ACMG/AMP evidence strengths.
- CADD — Combined Annotation-Dependent Depletion, a genome-wide score that integrates conservation, regulatory annotation, and protein-level features, scaled to a Phred-like rank. Updated periodically; the current version is described by Rentzsch et al. 2019 (Nucleic Acids Research 47:D886).
- AlphaMissense — a 2023 protein-language-model predictor that scores all possible missense substitutions in the human proteome, published by Cheng et al. 2023 (Science 381:eadg7492). The pre-computed scores are publicly available and have been incorporated into Ensembl VEP.
- SpliceAI — the deep-learning splice-impact predictor of Jaganathan et al. 2019 (Cell 176:535), calibrated for the ACMG/AMP supplementary splicing framework.
In silico predictors are supporting evidence under the ACMG/AMP framework, not stand-alone evidence: a high REVEL or AlphaMissense score is one input among the population-frequency, segregation, functional, and clinical evidence that a variant-curation expert panel weighs.
GA4GH and shared vocabulary
The Global Alliance for Genomics and Health (GA4GH) maintains the shared technical standards on which variant data exchange depends. The Variation Representation Specification (VRS) defines a precise, machine-readable, position-stable representation of variants (including normalisation, allele-set semantics, and identifier minting) that supports unambiguous exchange across resources. The complementary HGVS nomenclature (curated by HGVS itself) provides the human-readable variant descriptor; the two are increasingly used in tandem.
For phenotype, the Human Phenotype Ontology (HPO) at hpo.jax.org and the GA4GH Phenopackets standard support computable phenotype-aware exchange of cases. Evagene's Phenopackets export aligns with this standard.
Where Evagene fits
Evagene is an academic, research, and educational pedigree modelling platform. It does not perform sequencing, does not call or annotate variants, does not return ACMG/AMP classifications, and is not a variant interpretation service. Where this page touches the platform, the connection is between the structured family-history information Evagene records and the molecular result a clinician or laboratory has obtained elsewhere: known variant data captured against an individual on the pedigree may inform inheritance-pattern teaching, segregation tracing in a research family, or the input to a published family-history risk model.
Where computation against the canonical BOADICEA implementation is wanted, Evagene exports a ##CanRisk 2.0 pedigree file that the user uploads at canrisk.org; BOADICEA is licensed by the University of Cambridge and is not bundled in Evagene. For Tyrer / Duffy / Cuzick 2004 computation, the platform implements an IBIS-style approximation of the published algorithm (the official IBIS Breast Cancer Risk Evaluator binary is the canonical implementation; its full coefficients are not public). These conventions are described in detail on our risk-model pages.
Sources cited on this page
- Richards S, et al. Standards and guidelines for the interpretation of sequence variants. Genetics in Medicine 2015;17:405 — PMID 25741868 (the ACMG/AMP guidelines).
- Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020;581:434 — PMID 32461654 (gnomAD v2).
- Landrum MJ, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research 2018;46:D1062 — PMID 29165669.
- Cheng J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 2023;381:eadg7492 — PMID 37733863.
- Rentzsch P, et al. CADD: predicting the deleteriousness of variants. Nucleic Acids Research 2019;47:D886 — PMID 30371827.
- Ioannidis NM, et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. American Journal of Human Genetics 2016;99:877 — PMID 27666373.
- Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology 2018;36:983 — PMID 30247488 (DeepVariant).
- McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biology 2016;17:122 — PMID 27268795.
- Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53 — PMID 18987734.
- Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535 — PMID 30661751 (SpliceAI).
- Tavtigian SV, et al. Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genetics in Medicine 2018;20:1054 — PMID 29300386.
- ClinGen — clinicalgenome.org; ClinVar — ncbi.nlm.nih.gov/clinvar; OMIM — omim.org; DECIPHER — deciphergenomics.org; GA4GH — ga4gh.org.