Molecular architecture of genes: DNA, chromatin, and the genome as a structured molecule
A gene is more than a string of letters. The molecular architecture of genes spans nine orders of magnitude in length scale, from the Ångström hydrogen bonds between Watson–Crick base pairs to the metre-long DNA polymer folded inside a five-micron nucleus. This pillar surveys that architecture: nucleotide chemistry, the double helix, chromatin from nucleosome to topologically associating domain, the basics of replication and repair, and an overview of genome variation. It is written for researchers, educators, and advanced students.
Short version. A eukaryotic gene exists at several physical levels at once. Chemically it is a polymer of deoxyribonucleotides held in an antiparallel double helix. Mechanically it is wound around histone octamers into nucleosomes, then folded through 10 nm and (debated) 30 nm fibres into chromatin loops, topologically associating domains (TADs), A/B compartments, and ultimately into a chromosome. Functionally the higher-order architecture is read out by the same machinery that transcribes, replicates, and repairs the underlying sequence. This pillar links three companion pages: the chemistry and folding (DNA and chromatin organisation), the maintenance machinery (DNA replication and repair), and the variation observed across human genomes (genome structure and variation).
Nucleotide chemistry and the Watson–Crick double helix
Each strand of DNA is a polymer of four deoxyribonucleotides — adenosine, guanosine, cytidine, thymidine monophosphates — linked by phosphodiester bonds between the 3' hydroxyl of one sugar and the 5' phosphate of the next. The chain therefore has chemical polarity: a 5' end and a 3' end. Two strands run in opposite directions (antiparallel) and pair through hydrogen bonds between bases on opposing strands: adenine pairs with thymine through two hydrogen bonds, and guanine pairs with cytosine through three. The complementarity is the molecular basis for templated copying.
The double helix proposed by Watson and Crick in 1953 is right-handed, with ~10.5 base pairs per turn under physiological conditions, a major and a minor groove of distinct widths and chemistries, and bases stacked roughly perpendicular to the helical axis. The 1953 model rested on Erwin Chargaff's base-composition rules and X-ray diffraction by Rosalind Franklin and Maurice Wilkins; the same paper closed with the now-famous remark that the structure had not escaped their notice as a possible mechanism for the genetic material. Energetics matter: base stacking contributes more to duplex stability than hydrogen bonds, and GC-rich tracts melt at higher temperatures than AT-rich tracts. Alternative geometries (A-form, Z-form, Hoogsteen pairs, G-quadruplexes, i-motifs) appear under specific sequence and solvent conditions and have functional roles in transcription, replication, and telomere biology. The chemistry is laid out in detail in DNA and chromatin organisation.
Chromatin: from nucleosome to chromosome
The human haploid genome is ~3.05 gigabases — about two metres of B-form DNA — folded inside a nucleus of roughly five microns in diameter. The compaction problem is solved by chromatin: a hierarchy of folded states catalogued by Roger Kornberg's work in the 1970s and refined ever since.
- Nucleosome (~10 nm). The fundamental repeat unit. ~147 base pairs of DNA make 1.65 left-handed superhelical turns around an octamer of histones (two each of H2A, H2B, H3, and H4). The 2.8 Å crystal structure was solved by Luger et al. 1997 and remains the canonical reference. Linker DNA (~20–80 bp depending on cell type) connects adjacent nucleosomes, often associated with linker histone H1.
- 10 nm fibre. The "beads on a string" arrangement seen in low-salt EM preparations.
- 30 nm fibre (debated). Two competing models — one-start solenoid and two-start zigzag — have dominated textbooks for decades. In situ evidence from cryo-EM tomography and ChromEMT now suggests that the regular 30 nm fibre may be largely an artefact of dilute conditions, and that interphase chromatin in the nucleus is a more disordered 5–24 nm polymer.
- Loops and topologically associating domains (TADs). Cohesin-mediated loop extrusion forms transient loops anchored at convergent CTCF binding sites. TADs — self-interacting domains hundreds of kilobases to a few megabases in size — were identified by Hi-C in Dixon et al. 2012 and Nora et al. 2012.
- A/B compartments and chromosome territories. At the largest scales, the genome partitions into A (active, gene-dense, early-replicating) and B (inactive, gene-poor, late-replicating) compartments, first described by Lieberman-Aiden et al. 2009. Each chromosome occupies a non-random territory in the interphase nucleus.
The functional consequences are not cosmetic. Transcription, replication, recombination, and repair all read and write across this folded substrate, and disruption of chromatin architecture — for example through CTCF-binding-site mutation or cohesin dysfunction — can alter gene expression at substantial genomic distance. The full hierarchy is treated in DNA and chromatin organisation.
Replication and repair: keeping the sequence intact
Every human cell division copies ~6 billion base pairs of DNA at fork rates of around 1–2 kilobases per minute, and does so with an error rate after proofreading and mismatch repair of roughly one mistake per 109–1010 nucleotides. That fidelity is achieved by a layered system: high-fidelity replicative polymerases with 3'→5' exonuclease proofreading, post-replicative mismatch repair to correct slipped or mis-incorporated bases, and a series of damage-specific repair pathways for lesions introduced by metabolism, radiation, and chemicals.
Replication is licensed at thousands of origins by the origin recognition complex (ORC) and the MCM2-7 helicase, with origin firing strictly limited to once per cell cycle. The replisome carries a leading-strand polymerase (Pol ε in humans), a lagging-strand polymerase (Pol δ) priming Okazaki fragments through Pol α-primase, the CMG helicase, PCNA as a sliding clamp, and several accessory factors. Repair pathways include mismatch repair (MutS/MutL homologues; the pathway underlying Lynch syndrome), nucleotide excision repair (XPA-XPG, TFIIH; xeroderma pigmentosum), base excision repair (DNA glycosylases, AP endonuclease), and double-strand break repair through homologous recombination (BRCA1, BRCA2, RAD51) or non-homologous end joining (KU70/80, DNA-PKcs, LIG4). The integrated DNA damage response is reviewed by Jackson and Bartek 2009. The full pathway-by-pathway treatment is in DNA replication and repair; the educational pedigree-modelling pages for Lynch syndrome and BRCAPRO illustrate how the underlying repair-pathway biology is represented in published family-history models for research and teaching use.
The genome as a sequenced object
The first draft of the human genome was published in February 2001 in two parallel papers: the publicly funded International Human Genome Sequencing Consortium (Lander et al. 2001) and the Celera Genomics whole-genome shotgun assembly (Venter et al. 2001). The two drafts disagreed on gene count by a factor of two; the consensus settled at ~20,000 protein-coding genes, with another ~20,000 non-coding RNA genes catalogued since. The reference was finished in 2004; gaps in centromeres, segmental duplications, and acrocentric short arms were filled by the Telomere-to-Telomere (T2T) Consortium in 2022 with long-read sequencing.
Two reframings followed. First, only ~1–2% of the genome encodes protein. The remainder is regulatory DNA, RNA genes of various kinds, repeats, transposable elements, and sequence whose function (if any) is debated. The ENCODE Project's 2012 integrative analysis assigned biochemical activity (transcription, transcription-factor binding, chromatin marks, DNase hypersensitivity) to ~80% of the genome. The interpretation of "function" remains contested, but the catalogue of regulatory elements is now the standard reference. Second, the genome is highly variable between individuals: short variants average one per ~1,000 bases, structural variants (>50 bp) collectively rearrange megabases per individual, and ~50% of the genome is repetitive. The variation landscape is treated in genome structure and variation.
Three companion topics
This pillar opens onto three sub-pages, each treating one face of the molecular architecture in depth:
- DNA and chromatin organisation — nucleotide chemistry, base-pairing thermodynamics, supercoiling and topoisomerases, the histone octamer and nucleosome core particle, the 30 nm fibre debate, Hi-C methodology, A/B compartments, TADs, and the polymer-physics view of the interphase chromosome.
- DNA replication and repair — origins of replication and licensing, the replisome, leading and lagging strand mechanics, the four families of DNA polymerase, mismatch repair (Lynch syndrome biology), nucleotide and base excision repair (xeroderma pigmentosum), and double-strand break repair (BRCA1/BRCA2, KU70/80, NHEJ vs HR).
- Genome structure and variation — the repetitive and transposable-element fraction (LINEs, SINEs, LTRs, simple-sequence repeats, Alu elements), copy-number variation and the mechanisms that produce it (NAHR, NHEJ, FoSTeS), microdeletion and microduplication syndromes, the rise of long-read sequencing for structural-variant detection, and the gnomAD-SV catalogue.
Why molecular architecture matters for pedigree-modelling teaching
Family-history pedigree modelling sits one level up from the chemistry, but the chemistry constrains it. Mendelian segregation is a consequence of meiotic chromosome behaviour. Penetrance and expressivity are partly modulated by chromatin-level cis-regulation. De novo variation, mosaicism, and parent-of-origin effects derive directly from the replication and repair machinery. Structural variation that is invisible to short-read panels can underlie an apparently negative test result. A working understanding of molecular architecture is what makes a pedigree interpretation defensible rather than ritualistic.
Evagene's pedigree-modelling tools — the pedigree drawing tool, the Mendelian inheritance calculator, the hereditary cancer risk assessment family-history workflow — are intended for research, education, and teaching. Outputs from any of the 20 implemented risk models (Claus 1994, Couch 1997, Frank 2002, Evans 2004, Vasen 1999, Umar 2004, Gail 1989, Tyrer/Duffy/Cuzick 2004, BayesMendel BRCAPRO/MMRpro/PancPRO, family-history scoring) are illustrative, for research and teaching only, and not a recommendation. The Tyrer-Cuzick implementation is an IBIS-style approximation of the published Tyrer/Duffy/Cuzick 2004 algorithm, not the official IBIS Breast Cancer Risk Evaluator binary. BOADICEA is licensed by the University of Cambridge and is not bundled in Evagene; the platform exports a `##CanRisk 2.0` pedigree file that the user uploads at canrisk.org when BOADICEA computation is wanted.
Key references
- Watson JD, Crick FHC. A structure for deoxyribose nucleic acid. Nature 171:737–738 (1953). PMID 13054692.
- Luger K, Mäder AW, Richmond RK, Sargent DF, Richmond TJ. Crystal structure of the nucleosome core particle at 2.8 Å resolution. Nature 389:251–260 (1997). PMID 9305837.
- Lieberman-Aiden E et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289–293 (2009). PMID 19815776.
- Dixon JR et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485:376–380 (2012). PMID 22495300.
- Nora EP et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485:381–385 (2012). PMID 22495304.
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74 (2012). PMID 22955616.
- International Human Genome Sequencing Consortium (Lander ES et al.). Initial sequencing and analysis of the human genome. Nature 409:860–921 (2001). PMID 11237011.
- Venter JC et al. The sequence of the human genome. Science 291:1304–1351 (2001). PMID 11181995.
- Jackson SP, Bartek J. The DNA-damage response in human biology and disease. Nature 461:1071–1078 (2009). PMID 19847258.
Frequently asked questions
What is meant by the molecular architecture of genes?
The physical and chemical organisation of genetic material across many length scales: nucleotide chemistry, the antiparallel double helix, nucleosomes, chromatin fibres, loops, TADs, A/B compartments, and chromosome territories.
Who discovered the structure of DNA?
Watson and Crick proposed the antiparallel double helix in 1953 (Nature 171:737), drawing on Franklin and Wilkins's X-ray diffraction data and Chargaff's base-composition rules.
What is a nucleosome?
~147 bp of DNA wrapped 1.65 times around a histone octamer (two each of H2A, H2B, H3, H4). Crystal structure: Luger et al. 1997 at 2.8 Å.
What is a topologically associating domain (TAD)?
A self-interacting genomic region (typically hundreds of kb to a few Mb) within which contacts are enriched and across whose boundaries contacts are depleted. Identified in 2012 by Hi-C (Dixon et al.; Nora et al.).
How big is the human genome?
~3.05 gigabases haploid. Public draft 2001 (Lander et al., Nature 409:860); Celera draft (Venter et al., Science 291:1304); telomere-to-telomere completion 2022.
Is Evagene a clinical decision tool?
No. Evagene is an academic, research, and educational pedigree modelling platform. It is not a medical device and outputs are illustrative for research and teaching only.