LLM pedigree analysis: AI clinical genetics interpretation with BYOK, templates, and audit

A practical guide for clinicians and platform owners on using large language models to assist clinical genetics interpretation: where they help, where they fail, and how Bring Your Own Key, Analysis Templates, and audit trails make AI use defensible at the service level.

| 14 min read

Short version. Modern LLMs are useful for drafting pedigree-based clinical narratives: summarising family history, flagging inheritance patterns consistent with the data, identifying data gaps, and proposing testing considerations. They are not useful as clinical decision-makers; hallucination risk, model drift, and clinical liability are real. Evagene treats AI as a drafting aid and designs around the limits. Bring Your Own Key (BYOK) keeps LLM traffic inside your organisation's existing model-provider agreements — Anthropic Claude and OpenAI GPT are supported, with keys encrypted at rest using Fernet. Analysis Templates with variable injection ({{pedigree_description}}, {{proband_name}}, {{disease_list}}, {{risk_summary}}) make interpretation reproducible across a service. Every AI call is logged with timestamp, template, pedigree, and model identifier so clinical governance can audit outputs when model versions change.

This page is honest about what LLMs do well in clinical genetics and what they do badly, and explains the platform-level features that help teams use them responsibly.

What LLMs are genuinely good at

With a well-structured pedigree description as input, a current-generation LLM (Claude 4, Claude Sonnet, GPT-4o class or better) produces draft content that is useful for clinicians in specific ways:

  • Narrative summarisation. Turning a structured description of a 60-person family into a short, readable paragraph that highlights the clinically important features for a report.
  • Inheritance-pattern flagging. Noting that affected individuals in two generations are consistent with autosomal dominant inheritance with likely variable expressivity, or that a male-only affected pattern in maternal uncles is suggestive of X-linked recessive.
  • Data-gap identification. Spotting that age of onset is missing for three key affected relatives, or that a second-degree relative's cancer type is noted but no age at diagnosis is given.
  • Drafting structured sections. Producing a first-cut "family history summary", "key findings", "data limitations", and "screening considerations" that a clinician then edits.
  • Translation and simplification. Generating patient-friendly explanations of a finding at an appropriate reading level.

These are drafting tasks. The clinician's job is review and judgement; the model's job is reducing the typing.

What LLMs are bad at (and why it matters clinically)

The failure modes that matter in clinical genetics:

  • Hallucination. The model generates a confident, plausible statement that is not grounded in the input. In clinical interpretation, a hallucinated "Lynch syndrome pattern" or a fabricated quantitative risk figure is dangerous. Mitigation: ground the prompt in the structured pedigree description that Evagene generates deterministically; instruct the model to restrict statements to that input; require clinician review.
  • Model drift. Upgrading from one model version to another changes outputs for the same input, sometimes subtly. Consistency across a service — a key property of good clinical practice — is affected. Mitigation: pin model versions in Analysis Templates; record the model identifier in the audit log; re-validate templates when model versions change.
  • Clinical liability. No general-purpose LLM is a certified medical device. Its output is not a clinical decision; it is a draft. Mitigation: make this explicit in the interpretation document, require clinician sign-off, and do not present AI output directly to patients without review.
  • Confident silence. Models tend to produce output even when they should refuse. If a template is poorly phrased, it can produce an interpretation of a pedigree with insufficient data. Mitigation: prompt templates should include explicit "if data is insufficient, say so" instructions.
  • Prompt injection. Free-text fields on individuals can in principle contain content that nudges the model. Mitigation: sanitise and structure the data before it reaches the prompt; keep free-text fields clearly quoted and labelled.

None of these limits rule out LLM use in clinical genetics. They shape how it should be done.

BYOK: your keys, your provider

Evagene's AI interpretation engine is designed around Bring Your Own Key. Rather than Evagene holding a shared model-provider account and proxying your clinical data through it, each organisation configures its own Anthropic and/or OpenAI API key. The implications:

  • Traffic path. LLM calls go from Evagene directly to your chosen provider using your key. No intermediate vendor sits in between seeing clinical text.
  • Data-processing agreements. Your organisation has (or procures) a DPA with Anthropic or OpenAI under your terms. The LLM use is covered by agreements your governance has already reviewed.
  • Model choice and cost. You choose the model, the rate limits, and the spend. BYOK lets you pick Claude Opus for highest quality when it matters, or GPT-4o-mini for a research cohort where cost dominates.
  • Key handling. Keys are encrypted at rest in Evagene using Fernet (AES-128-CBC + HMAC-SHA256); the plaintext is never written to disk.
  • Quota bypass. BYOK bypasses Evagene's standard daily interpretation quota; your own provider quota applies instead.

For most clinical services, this is the model that makes AI interpretation procurable. If your organisation cannot risk-assess a new sub-processor just to use AI interpretation, the right question is whether it can re-use an existing Anthropic or OpenAI relationship. With BYOK, it can.

Analysis Templates for reproducible interpretation

Evagene's Analysis Templates are reusable prompt templates with two halves — a system prompt (the model's role and guardrails) and a user prompt (the task, with variable injection) — plus model configuration. Their purpose is to codify a service's interpretation style so every clinician gets consistent drafts.

The injected variables Evagene exposes include:

  • {{pedigree_description}} — deterministic natural-language rendering of the pedigree.
  • {{proband_name}}, {{proband_sex}}, {{proband_dob}}
  • {{disease_list}} — structured list of diseases annotated in the pedigree.
  • {{risk_summary}} — output of risk models (BRCAPRO, MMRpro, PancPRO, Mendelian) if previously computed.

A sketch of a template's shape (not a copy-paste prompt):

[system]
You are drafting a clinical genetics family history summary for
a UK genetics service. Structure output as: Family history summary;
Inheritance pattern assessment; Data limitations; Screening
considerations. If data is insufficient for any section, say so
explicitly. Do not invent quantitative risks.

[user]
Proband: {{proband_name}} (sex: {{proband_sex}}, DOB: {{proband_dob}})
Diseases annotated: {{disease_list}}
Risk models computed: {{risk_summary}}

Pedigree description:
{{pedigree_description}}

Draft the summary now.

Templates are versioned, can be shared across a service, and can be pinned to a specific model version. Running the same template against the same pedigree with the same model produces reproducible output — a material step up from ad-hoc prompting.

Audit trails

Every AI interpretation call made through Evagene is logged with the API key, timestamp, template ID and version, pedigree ID, model identifier (including version), token counts, and the generated output. This lets a clinical governance review:

  • Reconstruct exactly what input the model saw and which model version produced the output.
  • Identify and review all AI-assisted interpretations produced under a given template version.
  • Detect whether a model upgrade has shifted outputs in ways clinicians did not expect.
  • Revoke a compromised API key and enumerate which pedigrees were interpreted under it.

In a clinical-governance sense, AI use becomes a thing a service can defend: "this letter was drafted by model X version Y on date Z from template T, and reviewed by clinician C."

How this works in Evagene

The moving parts on the Evagene platform:

  • Analysis Templates let you define system and user prompts with variable injection and pin a model.
  • BYOK routes LLM calls through your Anthropic or OpenAI key; Fernet-encrypted at rest.
  • Pedigree description generation produces the deterministic natural-language rendering that {{pedigree_description}} substitutes into prompts.
  • Risk analysis outputs (BRCAPRO, MMRpro, PancPRO, Mendelian) feed into {{risk_summary}}, so AI interpretation can reference computed risks rather than the model guessing at them.
  • REST API exposes AI interpretation as a first-class endpoint; responses can be consumed synchronously for short interpretations or subscribed via the analysis.completed webhook.
  • MCP server does not call LLMs itself — it exposes pedigree tools to your AI client (Claude Desktop, Claude Code), whose own model is doing the reasoning. BYOK applies to Evagene's server-side templates, not to MCP traffic.
  • Audit log records every interpretation call for governance review.

Sensible use in a clinical service

A starter policy for a service bringing AI interpretation online:

  • Treat AI output as a draft. Every output gets clinician review before reaching the clinical record.
  • Use templates, not ad-hoc prompts. Review and version-control templates at service level.
  • Pin model versions. Upgrade deliberately and re-validate templates against the new version.
  • Use BYOK so AI traffic sits inside your existing provider agreements.
  • Log everything. Make the audit trail part of the governance process.
  • Flag AI involvement in reports that reach patients or referring clinicians.

This is similar in spirit to how services adopted speech-to-text transcription in clinical settings: useful tool, known failure modes, workflow redesigned to account for them.

Frequently asked questions

Can an LLM interpret a pedigree?

Modern LLMs draft structured narratives, flag inheritance patterns, and identify data gaps. They are drafting aids, not decision-makers.

What are the limits?

Hallucination, model drift, and clinical liability. Mitigated by grounded prompts, pinned model versions, clinician review, and audit.

What is BYOK?

Your Anthropic or OpenAI key handles all LLM traffic. Keys are Fernet-encrypted at rest. No intermediate vendor.

How do Analysis Templates help?

They capture the service's interpretation style in reusable, versioned prompt templates with variable injection. Reproducibility and consistency follow.

What about audit?

Every AI call is logged with timestamp, key, template, pedigree, model identifier, and output. Governance can reconstruct any interpretation.

Which models are supported?

Anthropic Claude and OpenAI GPT via BYOK. Model choice is per-request; model IDs are recorded in the audit log.

Related reading

Evaluate Evagene for your service

Join the Alpha waiting list. No credit card, no enterprise sales cycle — free access during Alpha for clinicians and research teams.

Join the Alpha Waiting List