Pedigree OCR: scanning a pedigree chart into structured clinical data
What pedigree image OCR does, when it is the right tool, how accurate it is, and how Evagene's image import pipeline helps recover pedigrees from hand-drawn sketches, PDFs, scans, and faxes — always with clinician review at the end.
Short version. A great deal of clinical pedigree data still lives as images: hand-drawn sketches in paper notes, scanned PDFs of older clinical letters, PDF exports from legacy pedigree software that no longer runs, and faxes. Pedigree OCR is the data-recovery feature that turns these images into structured data that a modern clinical platform can analyse, annotate, and integrate with risk models. It is not a replacement for clinician review — AI vision models make errors that a clinician can miss even in a careful reading — but it collapses most of the manual re-entry work and lets the clinician focus on verification rather than transcription. Evagene's implementation treats it as one import pathway among several (GEDCOM, JSON, 23andMe, XEG, image), with a consistent review step.
Why pedigree images still matter
Clinical genetics has a long history in written notes and hand-drawn diagrams. Many pedigrees that a modern service inherits are not in a structured data format: they are ink on paper, or scans of ink on paper, or screenshots from older software. Examples include:
- Hand-drawn pedigrees in paper notes. Ten or twenty years of notes from a service that predates digital pedigree tools.
- Scanned pedigrees in referral letters. A cardiology clinic letter attaches a sketched pedigree; the clinical geneticist receives a PDF.
- Legacy software exports. A service that used to run an older pedigree program has PDF printouts but no way to read the native file format in 2026.
- Pedigrees in publications. A family description in a paper has a pedigree figure that a researcher wants to bring into a modern workspace.
- Patient-provided pedigrees. A patient brings a pedigree they drew themselves on a piece of paper.
For each of these, manually redrawing the pedigree in a modern tool is the default, but it is slow, error-prone, and a source of avoidable variation. Pedigree OCR aims to get the clinician past the transcription step and into the review step directly.
What "OCR" means in this context
Traditional OCR (optical character recognition) reads text. Pedigree OCR is better thought of as pedigree recognition: recovering both text labels and graphical structure from an image. Modern implementations combine two kinds of model:
- Computer vision for shapes and relationships. Detecting circles and squares (and diamonds, triangles, lines through shapes), identifying partner lines and child bars, segmenting affected status from shading, detecting twin lines.
- AI vision and text models for annotations. Reading names, ages, diagnoses, notes, and free text associated with individuals.
The overall output is a structured pedigree with nodes, edges, and annotations that matches — as closely as possible — the pedigree depicted in the image. The closer the input matches standard NSGC notation, the better the output.
Accuracy and its limits
Pedigree OCR works well when the input is clear and standard; it works less well when the input is idiosyncratic, low-resolution, or heavily annotated with free-text labels. It is worth being explicit about where errors concentrate.
Typical good results. Basic family structure — number of individuals per generation, parent-child relationships, sex of each individual — is typically recovered with high accuracy when the pedigree uses standard shapes. Standard legend annotations (filled = affected for one condition, hatched = carrier, and so on) transfer well when the legend is present on the image.
Where errors concentrate. Twins in the pedigree — monozygotic vs dizygotic, connected via a single line descending to both individuals vs separate lines — are often misread. Complex consanguineous loops can confuse the structure detector. Multiple shading patterns for multiple conditions can be conflated. Handwritten labels in small or unusual scripts are less reliably read. Ages at diagnosis written in abbreviations (e.g. "Dx 42" or "BrCa 45y") can be misinterpreted.
Implications. A good pedigree OCR result is not a finished pedigree. It is a starting point that a clinician reviews against the source image and corrects as needed. The time saving is in the 70-90% of the pedigree that transfers cleanly, not in eliminating the review step.
A practical workflow
A disciplined workflow turns pedigree OCR from a novelty into a reliable clinical tool.
- Capture a clean image. A sharp photograph taken in good light, a high-resolution scan, or a PDF export works best. Avoid shadows, glare, and low-resolution phone screenshots where possible.
- Upload and extract. The software runs the image through its vision pipeline and returns a candidate structured pedigree.
- Compare side-by-side. A good implementation shows the extracted pedigree next to the original image so the clinician can scan for discrepancies without switching tabs.
- Correct methodically. Work through the extracted pedigree one generation at a time, checking structure first, then sex and affected status, then annotations. Use the original image to resolve any ambiguity.
- Save as a structured pedigree. Once reviewed, save the pedigree in the normal data model; downstream analysis (risk models, AI interpretation, exports) applies as to any other pedigree.
- Keep the source image. Attach the original image to the pedigree for provenance. A future clinician should be able to trace any entry back to its source.
When pedigree OCR is not the right tool
For routine clinical work on current cases, OCR is overkill. Drawing a new pedigree directly in a modern tool using gesture drawing or keyboard shortcuts is usually faster and always more accurate than drawing-then-photographing-then-extracting. Pedigree OCR is best reserved for one of two scenarios.
Data recovery. Where the pedigree exists only as an image — paper, scan, PDF, fax — and no structured export is available. Here OCR is genuinely time-saving and often the only practical path.
Patient-supplied pedigrees. Where a patient brings a pedigree they drew themselves. Scanning it in, reviewing, and correcting is usually faster than asking them to recite the family structure while the clinician draws.
For new pedigrees being constructed from consultation, live drawing in a modern tool is the better default.
Data privacy considerations
A pedigree image may contain identifying information about the proband and their relatives. Treat it with the same privacy discipline as any other clinical document. Check where the image is uploaded, how the extraction pipeline handles it, whether the image is retained, and for how long. A good platform will retain the image only as long as you need it and will encrypt it at rest.
For particularly sensitive documents — research pedigrees, forensic or medico-legal material — consider redacting identifiers before upload and re-adding them in the reviewed pedigree.
How Evagene supports pedigree OCR
Evagene accepts pedigree images (PNG, JPEG, TIFF, PDF) through the same import interface that handles GEDCOM, JSON, 23andMe raw data, and XEG. The image is processed through a vision pipeline that extracts candidate structure, affected status, and annotations, and the result is presented alongside the uploaded image for review. Once the clinician is satisfied, the pedigree is saved in the same structured data model as any other Evagene pedigree and is available to all downstream analysis: BayesMendel risk models (BRCAPRO, MMRpro, PancPRO), the Mendelian inheritance calculator, AI interpretation, and batch risk screening.
The original image is attached to the saved pedigree for provenance, so any clinician revisiting the record can see what the extraction was based on. The data is encrypted at rest and in transit, and follows the same access controls as any other pedigree in Evagene. OCR, like every other import pathway, ends with clinician review; Evagene does not commit extracted data to a pedigree without explicit save by the user.
For services inheriting a backlog of paper or scanned pedigrees, this can translate into a substantial time saving across a data-recovery project, without compromising the discipline of clinician verification.
Frequently asked questions
What is pedigree OCR?
Extraction of structured pedigree data (individuals, relationships, diagnoses) from an image — hand-drawn paper, scan, PDF, fax. Combines computer vision for shape and relationship detection with AI vision models for text labels.
Why would I need pedigree OCR?
To recover pedigrees that exist only as images, without re-entering them by hand. It is a data-recovery feature rather than a routine input method.
How accurate is it?
Varies with image quality and drawing style. Structure is typically recovered well; twins, consanguineous loops, multiple-condition shading, and handwritten notes are where errors concentrate. Clinician review is always required.
What is the workflow?
Upload image, AI extracts candidate pedigree, clinician reviews alongside the source image, corrects errors, saves as a structured pedigree. The original image is retained for provenance.
Is OCR a substitute for clinical review?
No. OCR is a time-saver for data entry, not a replacement for clinician judgement. Any extracted pedigree must be reviewed before clinical use.
What formats does Evagene accept?
PNG, JPEG, TIFF, and PDF. Higher resolution produces better results. The extracted pedigree enters the same data model as a pedigree drawn from scratch.