Demography and Population Structure — F-statistics, PCA, Admixture

Short version. Population structure is quantified by Wright's F-statistics (Fst, Fis, Fit). Modern analyses summarise structure with principal component analysis (PCA) on genotype matrices and with model-based clustering (STRUCTURE, ADMIXTURE). Founder effects produce locally elevated frequencies of specific variants — the BRCA1 185delAG / 5382insC and BRCA2 6174delT founder variants in Ashkenazi Jewish populations are well-documented examples, as are the Finnish disease heritage and the French-Canadian founder variants. Bottlenecks, expansions, and admixture have left distinct signatures on the genome that are now routinely inferred from sequenced cohorts. Ancient DNA has rewritten the demographic history of Europe and other regions over the past fifteen years.

Wright's F-statistics

Sewall Wright introduced a hierarchical decomposition of allele-frequency variance in the 1940s and 1950s. The three F-statistics partition the deviation from random-mating expectations across nested levels of population structure:

Fis (inbreeding coefficient within subpopulations): the deficit of heterozygotes within a subpopulation, relative to the HWE expectation given that subpopulation's allele frequencies. Captures non-random mating within demes (consanguinity, assortative mating).
Fst (subpopulation differentiation): the proportion of total genetic variance that is between subpopulations. Fst close to 0 means the subpopulations have very similar allele frequencies; Fst close to 1 means they are nearly fixed for different alleles. Pairwise human population Fst is typically small — on the order of 0.05 between continental groups, smaller within continents.
Fit (total inbreeding): the deficit of heterozygotes in the total population, relative to a pooled HWE expectation. Combines the within-subpopulation and between-subpopulation contributions.

The relationship is (1 − Fit) = (1 − Fis)(1 − Fst). Empirical Fst estimators (Weir and Cockerham 1984; Hudson, Slatkin and Maddison 1992) differ in normalisation; published values must be read with the estimator in mind. Pairwise Fst between human populations sampled at the continental level is typically around 0.05; between Europe and East Asia, around 0.08; between Europe and West Africa, around 0.10. These small values are consistent with most genetic variation being within rather than between human populations — one of the most-cited empirical findings of human population genetics.

Principal component analysis on genetic data

PCA on genotype matrices summarises structure without specifying a model. The standard formulation in human genetics is from Patterson, Price and Reich 2006 (PLoS Genet 2:e190), which provided the test of statistical significance (the Tracy-Widom distribution) for principal components and connected the empirical PCA to the underlying coalescent model. The leading principal components of a genome-wide genotype matrix in a continental human sample track geographic and ancestral structure: in European samples PC1 typically aligns with a north-south or east-west axis; in continental-scale samples, PC1 separates major ancestry groups.

PCA is the routine first analysis on any new cohort. In genome-wide association studies, principal components are included as covariates to control for population stratification. The number of components retained is typically chosen by inspection of the eigenvalue scree plot or by the Tracy-Widom test; in well-mixed continental cohorts, two to ten components are usual.

Model-based clustering: STRUCTURE and ADMIXTURE

Model-based clustering assigns each individual a vector of ancestry proportions over a specified number of ancestral components K. The original Bayesian implementation, STRUCTURE (Pritchard, Stephens and Donnelly 2000), used MCMC; the modern frequentist implementation, ADMIXTURE (Alexander, Novembre and Lange 2009, Genome Res 19:1655), uses block relaxation and is several orders of magnitude faster. Both require a pre-specified K and produce a Q matrix of ancestry proportions and a P matrix of allele frequencies in each component.

The interpretation caveat is sharp. ADMIXTURE / STRUCTURE outputs depend on the sample composition and the chosen K; the components are not "true" ancestral populations but statistical summaries of allele-frequency variation in the dataset. The components from one analysis are not directly comparable to those from another with different sampling. Cross-validation error or marginal likelihood is used to choose K, but neither is a definitive criterion. In practice, multiple values of K are reported and interpreted jointly with PCA and with f-statistics (f3, f4, D-statistics; see Patterson et al. 2012, Genetics 192:1065).

Founder effects

A founder effect occurs when a small subset of a parent population establishes a new population, which then expands. The new population carries a sample of the parent population's allele frequencies, with stochastic deviations: rare alleles in the parent that happen to be present in the founders can rise to substantial frequencies in the descendant population. Founder effects produce locally elevated frequencies of specific variants and locally extended haplotypes around them, both of which can be detected against a non-founder reference.

Ashkenazi Jewish founder variants in BRCA1 and BRCA2

Roa et al. 1996 (Nat Genet 14:185) documented three founder variants — BRCA1 185delAG, BRCA1 5382insC, and BRCA2 6174delT — segregating at substantially elevated frequency in the Ashkenazi Jewish population. The carrier frequency of any of the three is approximately 1 in 40, compared with much lower frequencies in non-Ashkenazi populations. These founder variants account for the bulk of BRCA1 and BRCA2 variants documented in Ashkenazi families. The pattern is the canonical example of a founder effect in human disease genetics: three specific variants, each on a long shared haplotype consistent with descent from a small number of founder chromosomes a few hundred to a thousand years ago.

The clinical significance of the founder structure is substantial: the prior probability that an Ashkenazi-ancestry individual carries one of the three variants differs from the prior in non-founder populations by approximately an order of magnitude, and this is reflected in published guidance and in the priors used by risk-model algorithms. Evagene's documentation pages on hereditary cancer risk assessment and the BayesMendel BRCAPRO implementation surface the founder-prior literature.

Finnish heritage diseases

The Finnish population has been comprehensively characterised as a founder population, with a documented set of approximately forty rare recessive conditions that occur at elevated frequency in Finland and are rare elsewhere — the "Finnish disease heritage" (the canonical reviews are Norio 2003 and the OMIM entries cataloguing the conditions). The conditions include congenital nephrotic syndrome of the Finnish type (NPHS1), Salla disease (SLC17A5), aspartylglucosaminuria (AGA), and Mulibrey nanism (TRIM37), among others. The founder structure reflects the demographic history of Finland: a small founding population that experienced population bottlenecks and limited gene flow with neighbours over centuries, expanding to the contemporary census of approximately 5.6 million.

French-Canadian founders

The Québec French-Canadian population descends from approximately 8,500 founders who emigrated from France in the seventeenth and eighteenth centuries, with subsequent expansion. A documented set of conditions occurs at elevated frequency: tyrosinaemia type I (FAH, particularly in the Saguenay-Lac-Saint-Jean region), hereditary spastic ataxia of Charlevoix-Saguenay (SACS), oculopharyngeal muscular dystrophy (PABPN1), and others. The pattern is comparable to the Finnish heritage, with a different founder set and a different list of variants.

Other documented founder structures include the Old Order Amish in Pennsylvania (Ellis-van Creveld syndrome, glutaric aciduria type I), the Hutterite Brethren in North America, and several geographically isolated populations in the Mediterranean and East Africa. The Iceland population is sometimes described as a founder population in this sense, but its larger founder size and longer expansion give a structure intermediate between classical small founders and panmictic continental populations.

Bottlenecks and expansions

A population bottleneck is a sharp reduction in population size, often followed by recovery. Bottlenecks reduce genetic diversity (lower nucleotide diversity, fewer rare variants relative to common variants), increase linkage disequilibrium (longer haplotype blocks), and shift the site-frequency spectrum (a deficit of rare variants under a strong, recent bottleneck). The Out-of-Africa bottleneck around 60,000 to 70,000 years ago is the most-studied example in human demographic history: non-African populations carry a reduced subset of the diversity of African populations, on shorter coalescent times.

Population expansions show the opposite signature: an excess of rare variants relative to a constant-size expectation. Most contemporary human populations have expanded substantially in the past 10,000 years (the agricultural transition) and more dramatically in the past 200 years (the demographic transition); the rare-variant burst from these recent expansions is one of the largest features of contemporary site-frequency spectra and a major reason for the per-capita rare-variant abundance in human cohorts.

Inferring demographic history from sequenced genomes

Modern demographic inference uses the coalescent applied to whole-genome sequences. The pairwise sequentially Markovian coalescent (PSMC), introduced by Li and Durbin 2011, infers a single individual's effective population size over time by analysing heterozygous sites along the genome. The multiple sequentially Markovian coalescent (MSMC) and its successor MSMC2, introduced by Schiffels and Durbin 2014 (Nat Genet 46:919), extend the approach to multiple genomes, recovering population-size trajectories and population-split times.

Applied to human genomes, PSMC and MSMC2 recover the Out-of-Africa bottleneck, the divergence of contemporary continental populations, and the more recent population-size dynamics within continents. The methods agree broadly with parallel inferences from site-frequency spectra under explicit demographic models (∂a∂i, fastsimcoal, momi2). The pre-history of the past several hundred thousand years is now reconstructed in considerable detail from genome-scale data.

Admixture

Admixture is the gene flow between previously separated populations. It is detected by ancestry-deconvolution methods (RFMix, Tractor) and quantified by f-statistics, ALDER, GLOBETROTTER, and similar tools. The expected length of admixture-derived ancestry tracts in a contemporary individual decreases over time after the admixture event (as recombination breaks them up); the time of admixture is therefore inferable from the distribution of tract lengths.

Worked examples: African Americans have a population-mean ancestry of approximately 75 to 85 per cent African and 15 to 25 per cent European, with substantial inter-individual variation; the admixture occurred predominantly over a window of roughly 200 to 400 years (about 7 to 15 generations) consistent with the historical record. Latino populations in the Americas show three-way admixture of European, Indigenous American, and African ancestry, in proportions that vary by country and by region within country — Mexican-American mean ancestry is typically dominated by Indigenous American and European, Caribbean Latino populations carry substantially higher African ancestry, and South American populations show further regional variation. The ALDER and GLOBETROTTER methods date the admixture events to consistent post-1492 windows. In all of these cases, ancestry-aware analyses (PRS portability, GWAS controls) are now standard.

Ancient DNA

Sequencing of DNA from ancient human remains has transformed demographic history. The methodological breakthroughs of the past fifteen years — library preparation from highly degraded DNA, capture enrichment, and contamination control — have produced thousands of ancient human genomes covering Eurasia, the Americas, and to a lesser extent Africa. The Paäbo and Reich groups have led the field; the synthesis in Reich 2018 (Who We Are and How We Got Here) covers the European story in particular detail.

Key findings: contemporary Europeans descend from at least three major ancestral components — Western Hunter-Gatherers, Anatolian Neolithic Farmers (associated with the spread of agriculture from the Near East), and Steppe pastoralists (associated with the third-millennium-BC Yamnaya expansion) — in proportions that vary by region. Neanderthal and Denisovan introgression into anatomically modern humans is detectable in contemporary non-African genomes (1 to 2 per cent Neanderthal across non-Africans, additional Denisovan ancestry in Oceanian populations). The peopling of the Americas, the prehistory of the Eurasian Steppe, and the demographic dynamics of South Asia, East Asia, and parts of Africa have all been substantially revised against the ancient-DNA record.

Reference cohorts and the diversity gap

Modern population genetics depends on reference cohorts. The 1000 Genomes Project (Auton et al. 2015, Nature 526:68) provided sequenced genomes of approximately 2,500 individuals from 26 populations, and remains a foundational reference. The Genome Aggregation Database (gnomAD), aggregating exomes and genomes from research cohorts, now contains over 800,000 individuals across multiple ancestry groups; constraint metrics and variant-frequency catalogues drawn from gnomAD are used routinely in clinical-genetics interpretation.

The UK Biobank has approximately 500,000 individuals, predominantly of European ancestry, with deep phenotyping. It is the largest single resource for European-ancestry genome-phenome work, but the under-representation of non-European populations in UK Biobank and equivalent biobanks has driven the diversity-of-cohorts conversation in human genetics. H3Africa (Human Heredity and Health in Africa) was established in part to address that gap, building African-led genomic resources across multiple African countries. Comparable diversification efforts include All of Us (United States) and a number of country-specific biobanks. The motivation is both scientific (population-specific variants and demographic histories are not represented in European-ancestry-dominated cohorts) and ethical (research benefit must reach the populations that contribute samples).

Why structure matters for pedigree work

Population structure intersects pedigree-based work at the level of priors. The carrier frequency of a recessive disease allele in the population a partner is sampled from sets the prior probability of carrier status when family history is unrevealing; founder structure changes that prior substantially in known founder populations. The penetrance of a dominant variant estimated from one population may not transfer cleanly to another. Polygenic risk scores estimated in European-ancestry GWAS perform less well in non-European populations, with the magnitude of the loss correlated with genetic distance from the discovery cohort. None of these are reasons not to do family-history work in diverse populations; they are reasons to be explicit about the population a calculation is parameterised against and to choose appropriate inputs. Evagene's implementations of published risk-model algorithms surface the population-specific parameters where they exist, and the user-facing pages emphasise the ancestry context.

Frequently asked questions

What does Fst measure?

Fst is the proportion of total genetic variance attributable to between-subpopulation differences. Fst close to 0 means subpopulations have similar allele frequencies; Fst close to 1 means they are nearly fixed for different alleles. Pairwise Fst between continental human populations is typically around 0.05 to 0.10.

What is a founder effect?

An elevated frequency of specific variants in a descendant population, produced when a small subset of a parent population establishes a new population that then expands. The Ashkenazi Jewish founder variants in BRCA1 and BRCA2 (Roa et al. 1996) and the Finnish disease heritage are well-documented examples.

Why is PCA used as the first analysis on a new cohort?

PCA summarises population structure in the genotype matrix without specifying a model, runs quickly on genome-scale data, and produces principal components that are routinely included as covariates in genome-wide association studies to control for stratification.

What does PSMC infer?

A single individual's effective population size over time, from the distribution of heterozygous sites along their genome. PSMC and its multi-genome successors (MSMC, MSMC2) recover demographic-history features including the Out-of-Africa bottleneck and population-split times.

Why are non-European cohorts under-represented in human-genetics research?

Historically, large biobanks (UK Biobank, FinnGen, deCODE) recruited primarily within European-ancestry populations, and discovery genome-wide association studies have followed. Diversification efforts (H3Africa, All of Us, country-specific biobanks) are now expanding the reference, but the imbalance has consequences for the portability of polygenic risk scores and for the equity of genomic-research benefit.

Demography and population structure: migration, founder effects, and admixture