Genome structure and variation: repeats, copy number variation, and structural variants
The first sequencing of the human genome in 2001 reframed our understanding of what most of the genome is: only ~1–2% of the sequence encodes protein, and at least ~50% is repetitive, much of it derived from transposable elements. The decade that followed reframed our understanding again, by showing that human genomes differ structurally on a scale much larger than single nucleotides — deletions, duplications, insertions, inversions, and complex rearrangements collectively rearrange megabases between any two individuals. This page surveys the repetitive and structurally variable genome, the mechanisms that produce copy-number variation, the catalogues that index it, and the rise of long-read sequencing for structural-variant detection.
Short version. At least ~50% of the human genome is repetitive: LINEs, SINEs (including ~1.1 million Alu elements), LTR-containing endogenous retroviruses, DNA transposons, and simple sequence repeats. Copy-number variants (CNVs) and other structural variants >50 bp arise principally through non-allelic homologous recombination between segmental duplications, end-joining, and replication-based mechanisms (FoSTeS / MMBIR). Short-read sequencing recovers ~5,000–10,000 SVs per genome; long-read platforms recover 20,000–25,000. The reference catalogue of population SV frequencies is gnomAD-SV.
The repetitive fraction of the human genome
The 2001 draft of the human genome (Lander et al. 2001) reported that ~45% of the assembly was identifiably repetitive. Improved annotation methods, deeper search libraries, and the T2T-CHM13 telomere-to-telomere assembly have since pushed the figure to at least 50% and arguably closer to two-thirds, with the additional fraction concentrated in centromeric satellite arrays, segmental duplications, and recently mobilised transposable elements that earlier short-read assemblies could not resolve. The composition by class:
- LINEs (long interspersed nuclear elements) — ~21% of the genome. The dominant active family is L1 (LINE-1), an autonomous retrotransposon ~6 kb long carrying two open reading frames (ORF1 RNA-binding chaperone, ORF2 endonuclease and reverse transcriptase). L1 is the only currently active autonomous retrotransposon in the human genome. ~80–100 L1 copies are still mobilisation-competent in any individual genome.
- SINEs (short interspersed nuclear elements) — ~13% of the genome. Non-autonomous retrotransposons of ~100–300 bp that mobilise in trans using L1 machinery. The dominant primate SINE is Alu, a ~300 bp element derived from the 7SL signal-recognition-particle RNA. There are ~1.1 million Alu copies in the human genome.
- LTR-containing endogenous retroviruses (ERVs) — ~8% of the genome. Remnants of past germline retroviral integrations; most have lost the capacity to mobilise but a subset retain individual functional ORFs (e.g. syncytin, derived from an HERV-W envelope, with placental fusogenic function).
- DNA transposons — ~3% of the genome. All extinct in the human lineage (the last activity was tens of millions of years ago) but a major source of co-opted regulatory elements and exonic sequence.
- Simple sequence repeats and segmental duplications — ~3–5% of the genome (microsatellites, minisatellites, satellite arrays at centromeres and acrocentric short arms, and segmental duplications >1 kb at >90% identity).
Transposable element biology is reviewed by Cordaux and Batzer 2009; the post-Human-Genome-Project synthesis of the genome's structure by Lander 2011.
Transposable element biology and retrotransposition
Retrotransposition proceeds through an RNA intermediate. For L1: the L1 RNA is transcribed by RNA polymerase II, exported, and translated; ORF1 and ORF2 proteins associate with their own encoding RNA in cis to form a ribonucleoprotein particle that re-enters the nucleus. ORF2 endonuclease cleaves the genomic target at a TTAAAA-like consensus, and reverse transcription proceeds directly from the nicked 3' end (target-primed reverse transcription). The result is a new L1 insertion flanked by short target-site duplications, frequently 5'-truncated, and often carrying a 3' poly-A tail.
Alu and SVA elements piggy-back on the L1 machinery in trans. Alu activity is sustained by a small number of recently emerged source elements (AluY and the AluYa5 / AluYb8 lineages); polymorphic Alu insertions still segregate in modern human populations. New Alu and L1 insertions are detected in normal individuals at rates of approximately 1 per 20 births and 1 per 100 births respectively, and a fraction of these can disrupt gene function and cause Mendelian disease (e.g. an L1 insertion into the F8 gene was the first reported instance of human disease caused by retrotransposition). Transposable elements have also extensively reshaped gene regulation: the dominant view since ENCODE and a series of follow-up studies is that a substantial proportion of mammalian transcription-factor binding sites and tissue-specific enhancers are TE-derived.
Copy number variation
Copy-number variation (CNV) refers to deletions or duplications of genomic regions >50 bp (the conventional lower size cutoff for "structural variant"; CNVs in the older array-based literature are often >1 kb). Two parallel papers in 2004 (Sebat et al.; Iafrate et al.) showed that ostensibly normal human genomes carry hundreds of large copy-number polymorphisms relative to the reference. Conrad et al. 2010 catalogued ~11,700 CNVs in 450 individuals using high-resolution array CGH. Subsequent whole-genome-sequencing surveys (Sudmant et al. 2015) refined the picture: typical individuals carry ~2,500–3,000 CNVs collectively spanning ~20 megabases, with a long tail of rarer variants.
Mechanisms of copy-number variation
Three principal mutational mechanisms generate CNVs.
Non-allelic homologous recombination (NAHR). When flanking segmental duplications or repeat elements with high sequence identity recombine in trans (between sister chromatids on the same allele or between homologous chromosomes during meiosis), the misalignment generates reciprocal duplications and deletions. NAHR-mediated CNVs are recurrent: the breakpoints cluster within the flanking homologous repeats, and unrelated affected individuals carry near-identical rearrangements. Most of the canonical microdeletion / microduplication syndromes (DiGeorge / 22q11.2 deletion, Williams-Beuren / 7q11.23 deletion, Smith-Magenis / 17p11.2 deletion, Charcot-Marie-Tooth / 17p12 duplication) arise by NAHR between flanking segmental duplications.
Non-homologous end joining (NHEJ) and microhomology-mediated end joining (MMEJ). When a double-strand break is repaired by joining unrelated sequences, the result is a non-recurrent CNV with breakpoints distributed effectively at random across the genome. Microhomology of 1–15 bp is often found at the junction.
Replication-based mechanisms: FoSTeS and MMBIR. Fork stalling and template switching (FoSTeS) was proposed by Lee et al. 2007 to account for non-recurrent rearrangements with templated insertions and complex breakpoint structure that NHEJ cannot explain. Microhomology-mediated break-induced replication (MMBIR) is a related model in which a collapsed fork is restarted at a heterologous template using short microhomology. Both mechanisms generate complex CNVs with templated segments, sometimes with multiple breakpoints and inversions in the same event. They are now recognised as a major source of non-recurrent CNVs.
Microdeletion and microduplication syndromes
The genomic-disorder concept (Lupski 1998) refers to clinical phenotypes caused by recurrent CNVs at loci flanked by segmental duplications. Established examples (each with detailed entries in OMIM, GeneReviews, and ClinVar) include 22q11.2 deletion (DiGeorge / velocardiofacial), 7q11.23 deletion (Williams-Beuren), 17p11.2 deletion (Smith-Magenis) and reciprocal duplication (Potocki-Lupski), 16p11.2 deletion / duplication (autism / developmental phenotypes), 1q21.1 deletion / duplication, 15q11.2-q13 deletion (Prader-Willi / Angelman, with parent-of-origin imprinting effects), 17q12 deletion (renal cysts and diabetes), and 17p12 duplication (Charcot-Marie-Tooth 1A) / deletion (HNPP).
The Database of Genomic Variants (DGV) was the first widely used reference for population-level CNV frequency from microarray studies. ClinGen Dosage Sensitivity, ClinVar, DECIPHER, and OMIM are the standard references for clinically annotated CNVs and their associated phenotypes; gnomAD-SV (below) provides allele-frequency estimates for the general population from short-read sequencing.
Structural variant detection from short reads
Short-read whole-genome sequencing (typically 150 bp paired-end Illumina at 30× coverage) detects structural variants through four orthogonal signal types:
- Read-depth. Deletions reduce, and duplications increase, the local read coverage relative to the genome-wide median. Read-depth resolves CNVs reliably down to a few kb but loses sensitivity at smaller sizes and offers no breakpoint precision.
- Read-pair (discordant pairs). Paired reads whose insert size, orientation, or chromosomal placement deviates from expectation indicate structural variation: large insert sizes for deletions, small for insertions, inverted orientations for inversions, and trans-chromosomal pairs for translocations.
- Split-read. A read that aligns partly to one location and partly to another places a structural-variant breakpoint at single-base precision, and is essential for resolving NAHR breakpoints within segmental duplications and small (<1 kb) variants.
- Assembly-based. Local de novo assembly of unmapped reads or reads in a candidate region produces a contig that can be aligned back to the reference to recover insertion sequences and complex junctions invisible to alignment-only methods.
Modern SV callers (Manta, Delly, GRIDSS, LUMPY) integrate two or more of these signals; population-scale callers add genotype-likelihood and joint-calling layers. Even with integration, short-read SV calling has fundamental blind spots in repetitive regions, large insertions exceeding the insert size, and complex multi-breakpoint events.
Long-read sequencing and the SV revolution
Long-read platforms — Pacific Biosciences (PacBio HiFi, ~20 kb reads at >99% accuracy via circular consensus sequencing) and Oxford Nanopore Technologies (ONT, ~10–100 kb reads, with R10.4.1 chemistries reaching modal accuracies >99%) — transformed structural-variant detection. Chaisson et al. 2019 reported a multi-platform comparison showing that long reads detect 20,000–25,000 SVs per genome compared to ~5,000–10,000 by short-read methods, with the gain concentrated in insertion-class SVs (which short reads chronically miss), SVs in repetitive regions, and complex breakpoint reconstructions.
Long reads also enable haplotype-resolved de novo assembly of human genomes. Tools such as hifiasm and HiCanu produce phased contigs at chromosome scale, and the resulting assemblies expose structural variation that no short-read alignment to a single reference can recover. The Human Pangenome Reference Consortium (HPRC) has now released phased diploid assemblies for several dozen individuals; the implication, increasingly accepted, is that no single linear reference can adequately represent the diversity of human structural variation, and graph- or pangenome-based representations are needed for high-fidelity downstream analysis.
gnomAD-SV and population catalogues
Collins et al. 2020 reported gnomAD-SV v2: 433,371 structural variants >50 bp called from short-read whole-genome sequencing of 14,891 individuals, providing population allele frequencies stratified by ancestry. The dataset is the standard reference for structural-variant rarity in research and educational interpretation. Subsequent gnomAD releases (v3 / v4) expanded the cohort substantially and incorporated additional variant-class refinements; gnomAD continues to integrate long-read data as it scales.
Other major catalogues: the 1000 Genomes Project structural-variation phase 3 release (Sudmant et al. 2015); the Human Pangenome Reference Consortium graph-based releases; DECIPHER for clinically annotated CNVs and associated phenotypes; DGV for the historical population CNV reference; ClinVar and ClinGen for variant-level clinical annotation. dbSNP retains a large fraction of small structural variants alongside SNPs and indels.
Why this matters for pedigree-modelling teaching
Structural variation is a substantial source of inherited and de novo disease that short-read panels can systematically miss. A negative gene-panel result on a family with a phenotype consistent with a Mendelian disorder may reflect a structural variant invisible to the panel rather than absence of a causative variant. The educational pedigree-modelling pages on this site (hereditary cancer risk assessment, Mendelian inheritance calculator, germline mosaicism calculator) are intended for research and teaching only; outputs from any of the 20 implemented risk models are illustrative, not a clinical recommendation. The Tyrer-Cuzick implementation is an IBIS-style approximation of the published Tyrer/Duffy/Cuzick 2004 algorithm, not the official IBIS Breast Cancer Risk Evaluator binary. BOADICEA is licensed by the University of Cambridge and is not bundled in Evagene; the platform exports a `##CanRisk 2.0` pedigree file for upload at canrisk.org when BOADICEA computation is wanted.
Key references
- International Human Genome Sequencing Consortium (Lander ES et al.). Initial sequencing and analysis of the human genome. Nature 409:860–921 (2001). PMID 11237011.
- Lander ES. Initial impact of the sequencing of the human genome. Nature 470:187–197 (2011). PMID 21307931.
- Cordaux R, Batzer MA. The impact of retrotransposons on human genome evolution. Nat Rev Genet 10:691–703 (2009). PMID 19763152.
- Sebat J et al. Large-scale copy number polymorphism in the human genome. Science 305:525–528 (2004). PMID 15273396.
- Conrad DF et al. Origins and functional impact of copy number variation in the human genome. Nature 464:704–712 (2010). PMID 19812545.
- Sudmant PH et al. An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81 (2015). PMID 26432246.
- Collins RL et al. A structural variation reference for medical and population genetics. Nature 581:444–451 (2020). PMID 32461652.
- Chaisson MJP et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 10:1784 (2019). PMID 30992455.
- Database of Genomic Variants (DGV): dgv.tcag.ca. ClinVar: ncbi.nlm.nih.gov/clinvar. DECIPHER: decipher.sanger.ac.uk. OMIM: omim.org.
Frequently asked questions
What fraction of the human genome is repetitive?
At least ~50%, and with improved annotation closer to two-thirds. The largest classes are LINEs, SINEs (including ~1.1 million Alu elements), LTR-containing endogenous retroviruses, and DNA transposons.
What is an Alu element?
A primate-specific SINE of ~300 bp derived from the 7SL signal-recognition-particle RNA. There are ~1.1 million copies in the human genome, mobilising in trans on L1 machinery; recently active subfamilies still produce polymorphic insertions in human populations.
What mechanisms produce copy number variants?
Non-allelic homologous recombination (NAHR) between flanking segmental duplications generates recurrent CNVs; NHEJ / MMEJ generates non-recurrent CNVs at random locations; replication-based mechanisms (FoSTeS, MMBIR) generate complex non-recurrent CNVs with templated segments.
What is gnomAD-SV?
The structural-variant subset of gnomAD. The reference v2 release (Collins et al. 2020, Nature 581:444) catalogued 433,371 SVs across 14,891 short-read genomes, with subsequent releases scaling further and incorporating long-read data.
How does long-read sequencing change SV detection?
Long-read platforms (PacBio HiFi, Oxford Nanopore) detect 20,000–25,000 SVs per genome compared to ~5,000–10,000 by short reads, and enable haplotype-resolved de novo assembly that recovers structural variation invisible to short-read alignment against a linear reference.