Mapping & Gene Identification — Linkage, GWAS, Exome, WGS

Short version. Gene mapping is the empirical programme that turns a Mendelian phenotype into a chromosomal address and then into a specific transcript. Three eras: classical linkage and positional cloning (1980s-90s, with CFTR and HTT as the canonical successes), GWAS (mid-2000s onwards, common variants of small effect), and high-throughput sequencing (exome from 2009, whole-genome from the early 2010s, with the 100,000 Genomes Project and Matchmaker Exchange now the infrastructure for ultra-rare phenotypes). Pedigrees remain the data structure throughout.

Classical linkage analysis and the LOD score

Linkage analysis tests whether a marker locus and a phenotype-causing locus segregate non-independently in pedigrees. The recombination fraction θ between two loci is the probability of a recombination event between them in a single meiosis; for unlinked loci on different chromosomes, θ = 0.5, and for tightly linked loci, θ approaches 0. The statistical test for linkage is the LOD score (logarithm of the odds), introduced by Newton Morton in 1955 (Am J Hum Genet 7:277-318, PMID 13258560):

Z(θ) = log₁₀ [ L(θ) / L(0.5) ]

where L(θ) is the likelihood of the pedigree given a recombination fraction θ and L(0.5) is the likelihood under the null hypothesis of no linkage. The conventional thresholds (Morton 1955) are LOD ≥ 3.0 to declare linkage and LOD ≤ -2.0 to exclude linkage. The LOD score combines additively across families, which is the central practical reason it is the framework that mapped most Mendelian conditions before high-throughput sequencing.

Linkage analysis with classical markers (RFLPs, microsatellites, then SNP panels) was the dominant gene-mapping strategy from the early 1980s through to the late 1990s. Botstein et al. 1980 (Am J Hum Genet 32:314, PMID 6247908) provided the genome-wide marker recipe; the first major positional successes followed within a decade.

Positional cloning: CFTR (1989) and HTT (1993)

Positional cloning is the strategy of locating a phenotype's gene by linkage, narrowing the candidate interval by additional markers and recombinant pedigrees, then identifying the gene by examining the candidate interval. The two canonical positional-cloning successes are CFTR in cystic fibrosis (1989) and HTT in Huntington disease (1993).

Cystic fibrosis was mapped to chromosome 7q31 by Tsui and colleagues in 1985 using RFLPs in CF families, and the gene itself was cloned by the Riordan, Rommens, and Kerem teams across three companion Science papers in 1989. Riordan et al. 1989 (Science 245:1066-1073, PMID 2475911) reported the cDNA cloning and predicted protein sequence of the cystic fibrosis transmembrane conductance regulator. Rommens et al. 1989 (Science 245:1059-1065, PMID 2772657) reported the chromosome walk and jump that identified the candidate region. Kerem et al. 1989 (Science 245:1073-1080, PMID 2570460) identified the c.1521_1523delCTT variant (subsequently renamed p.Phe508del, the most common pathogenic variant in CFTR) and showed it was on the disease-associated haplotype in a majority of CF families. Cystic fibrosis (OMIM 219700) was the first major Mendelian condition to be cloned positionally, without prior knowledge of the underlying biochemical defect.

Huntington disease was mapped to chromosome 4p16 in 1983 (Gusella et al., Nature 306:234), but the gene took another decade to clone. The Huntington's Disease Collaborative Research Group identified the IT15 (now HTT) CAG-repeat expansion in 1993 (Cell 72:971-983, PMID 8458085). The CAG-expansion mechanism explained both the dominant inheritance and the phenomenon of anticipation observed in Huntington pedigrees (OMIM 143100). The decade-long positional-cloning effort for HTT remains the canonical case study for the difficulty of finishing the last interval of a positional clone before the high-throughput sequencing era.

From linkage to association

Linkage analysis is well-suited to highly penetrant, single-locus, large-effect phenotypes that segregate in informative pedigrees. It is poorly suited to common diseases — where every common variant has small effect and any single family is uninformative for the underlying genetic architecture. Risch and Merikangas 1996 (Science 273:1516-1517, PMID 8801636) made this point formally: for a common disease with relative risks per variant in the 1.2-1.5 range, the sample sizes required for linkage detection are prohibitively large, while population-based association studies of unrelated cases and controls are tractable at sample sizes of thousands. Risch and Merikangas thus reframed common-disease genetics around the case-control association study and the population-genome marker map, and the framework they proposed became GWAS.

GWAS — AMD (2005) and the WTCCC (2007)

The first widely cited successful genome-wide association study was Klein et al. 2005 (Science 308:385-389, PMID 15761122) on age-related macular degeneration. Klein et al. genotyped approximately 100,000 SNPs in 96 cases and 50 controls and identified a striking association signal at CFH on chromosome 1q32. The CFH Y402H variant (rs1061170) was associated with an odds ratio of about 4.6 per allele — a large effect by GWAS standards and the reason the signal was detectable in a small sample. The result demonstrated that genome-wide common-variant association was a tractable strategy for common disease.

The Wellcome Trust Case Control Consortium 2007 paper (Nature 447:661-678, PMID 17554300) was the proof-of-concept for the modern GWAS infrastructure: 14,000 cases (2,000 each across seven common conditions: bipolar disorder, coronary artery disease, Crohn's disease, hypertension, rheumatoid arthritis, type 1 diabetes, type 2 diabetes) and 3,000 shared controls, genotyped on 500,000 SNPs. The paper reported 24 independent association signals at genome-wide significance, established the practical thresholds (P < 5 × 10^-8) and the imputation, quality-control, and multiple-testing-correction pipelines that became standard, and seeded the analysis-pipeline tools (PLINK, IMPUTE, SNPTEST) used by the field. The NHGRI-EBI GWAS Catalog currently lists more than 50,000 published associations from over 6,000 publications across more than 5,000 phenotypes.

GWAS findings for common disease are typically common variants of small effect (odds ratios 1.05-1.5 per allele); explanatory power per variant is small but cumulative across many loci, and polygenic risk scores aggregate the signal across hundreds or thousands of variants. Common-disease genetics organised around polygenic / oligogenic models is covered separately at complex disease pedigree software.

Exome sequencing for Mendelian disease

The application of high-throughput sequencing to Mendelian-disease gene discovery in 2009 transformed the rate of new disease-gene identification. Ng et al. 2009 (Nature 461:272-276, PMID 19684571) reported exome sequencing of four unrelated individuals with the rare autosomal dominant Freeman-Sheldon syndrome (MYH3, OMIM 193700), demonstrating that all four shared rare deleterious variants in the same gene and confirming that exome sequencing of a small number of unrelated affected individuals was sufficient to identify the causative gene for a rare Mendelian phenotype.

Ng et al. 2010 (Nat Genet 42:30-35, PMID 19915526) was the equivalent demonstration for an autosomal recessive condition: exome sequencing of four unrelated individuals with Miller syndrome (post-axial acrofacial dysostosis, OMIM 263750) identified DHODH as the causative gene. The pair of papers established the operational recipe — sequence a few affected individuals, filter to rare variants in protein-coding regions, intersect the carrier sets across cases — that has since identified hundreds of new Mendelian-disease genes.

Two refinements to the exome-sequencing strategy proved important. First, the trio design (proband and both unaffected parents) is the most powerful for identifying de novo variants underlying severe sporadic phenotypes, including a substantial fraction of intellectual disability, autism spectrum disorder, and rare paediatric malformation syndromes. Second, the cohort design (sequence a hundred or more probands with a phenotypically homogeneous condition and look for recurrent gene-level signal) is the most powerful for identifying genes of incomplete penetrance or for resolving locus-heterogeneous conditions. Both designs require an accurate pedigree as input.

Whole-genome sequencing and the 100,000 Genomes Project

Whole-genome sequencing (WGS) extends the exome to non-coding regions and structural variants. WGS captures variants that exome sequencing misses, including non-coding regulatory variants (the BCL11A-HBB enhancer in haemoglobinopathies; UTR variants affecting RNA stability), structural variants (large deletions, duplications, inversions, complex rearrangements), and the repeat expansions that exome capture under-represents. WGS is the technology behind the major national rare-disease genomics programmes.

The 100,000 Genomes Project, run by Genomics England and described by Caulfield et al. in the 2017 Genomics England protocol (NHS England 2017), sequenced approximately 100,000 whole genomes from NHS participants with rare disease and cancer between 2014 and 2018. The 100kGP was the basis for the subsequent NHS Genomic Medicine Service. The 100kGP and equivalent national programmes elsewhere (the All of Us Research Program in the United States, the Estonian Biobank, FinnGen, the UK Biobank exome and whole-genome resources) have produced reference cohorts at a scale that supports both the discovery of new disease-gene associations and the population-allele-frequency tabulation that underlies variant interpretation. Population-allele-frequency reference is at gnomAD; clinical variant curation is at ClinVar; gene-level review is at GeneReviews.

Matchmaker Exchange

The bottleneck in ultra-rare-disease gene discovery is finding the second affected family. Where a phenotype is sufficiently distinctive, two families with the same condition will lead to the gene in days; where the phenotype is non-specific, finding the second family without a federated infrastructure is impractical. Matchmaker Exchange (Philippakis et al. 2015, Hum Mutat 36:915-921, PMID 26295439) is the federated network that connects rare-disease databases (GeneMatcher, DECIPHER, MyGene2, PhenomeCentral, Matchbox, RD-Connect Genome-Phenome Analysis Platform) so that a researcher with a single proband and a candidate gene can query for matching cases worldwide.

Matchmaker Exchange has been responsible for a substantial fraction of new Mendelian-disease gene discoveries since 2015 — the model has shifted from individual-laboratory candidate-gene work to federated, phenotype-and-gene-level queries against a shared infrastructure. Matchmaker Exchange queries depend on structured phenotype representation (Human Phenotype Ontology, hpo.jax.org) and on the structured pedigree as input. Phenotype-aware exchange is covered at Phenopackets pedigree.

Where pedigrees fit in modern gene discovery

Pedigrees are the input to almost every modern gene-discovery analysis. Trio-design exome sequencing requires the proband and both parents structured as a trio. Cohort-design analyses across affected probands depend on accurate sibship and parent-offspring annotation for cosegregation filtering. Linkage analysis on extended pedigrees, still useful for large founder-effect families and for refining the candidate interval in difficult phenotypes, requires the full pedigree. Matchmaker Exchange queries carry pedigree context with the candidate-gene call. The pedigree drawing and structured-data export tool that supports each of these is pedigree drawing tool; structured-data exchange formats include GEDCOM 5.5.1, PED for linkage workflows, and Phenopackets v2 for phenotype-aware exchange.

Evagene is an academic, research, and educational pedigree modelling platform, intended to support structured family-history documentation, teaching, and exploratory use of published risk models. It is not a medical device and is not intended to diagnose, prevent, monitor, predict, treat, or manage disease, determine eligibility for screening, testing, referral, or treatment, or replace professional clinical judgement. Outputs from any risk model implemented in Evagene are illustrative and for educational and research purposes only; the IBIS-style approximation of the Tyrer / Duffy / Cuzick 2004 algorithm in Evagene is not the official IBIS Breast Cancer Risk Evaluator binary; BOADICEA is licensed by the University of Cambridge and is not bundled in Evagene — Evagene exports a CanRisk 2.0 pedigree file for upload at canrisk.org.

Key references

Morton NE. 1955. Sequential tests for the detection of linkage. Am J Hum Genet 7:277-318. PMID 13258560.
Botstein D, White RL, Skolnick M, Davis RW. 1980. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314-331. PMID 6247908.
Riordan JR, Rommens JM, Kerem B, et al. 1989. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245:1066-1073. PMID 2475911.
Rommens JM, Iannuzzi MC, Kerem B, et al. 1989. Identification of the cystic fibrosis gene: chromosome walking and jumping. Science 245:1059-1065. PMID 2772657.
Kerem B, Rommens JM, Buchanan JA, et al. 1989. Identification of the cystic fibrosis gene: genetic analysis. Science 245:1073-1080. PMID 2570460.
The Huntington's Disease Collaborative Research Group. 1993. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell 72:971-983. PMID 8458085.
Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases. Science 273:1516-1517. PMID 8801636.
Klein RJ, Zeiss C, Chew EY, et al. 2005. Complement factor H polymorphism in age-related macular degeneration. Science 308:385-389. PMID 15761122.
Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661-678. PMID 17554300.
Ng SB, Turner EH, Robertson PD, et al. 2009. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272-276. PMID 19684571.
Ng SB, Buckingham KJ, Lee C, et al. 2010. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 42:30-35. PMID 19915526.
Philippakis AA, Azzariti DR, Beltran S, et al. 2015. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum Mutat 36:915-921. PMID 26295439.
NHGRI-EBI GWAS Catalog. ebi.ac.uk/gwas.
gnomAD — Genome Aggregation Database. gnomad.broadinstitute.org.
Genomics England, 100,000 Genomes Project. genomicsengland.co.uk.

Mapping and gene identification