Finding the undiagnosed: surfacing rare disease patients in EHR data

May 6, 2026 · Robert Smith

A recurring theme in rare disease is that the answer was hiding in plain sight. By the time a patient is finally diagnosed, the trail of clues (a pattern of labs, a cluster of symptoms, a sequence of specialist visits) is usually already in their electronic health record. It was simply never connected.

That observation is the foundation of computational patient finding: instead of waiting for a clinician to recognize a rare pattern one chart at a time, you search the whole population’s existing data for the patients who match.

What “computational phenotyping” actually means

A phenotype is a precise, structured definition of who counts as a case: the codes, lab thresholds, medications, and temporal patterns that characterize a disease, plus the inclusions and exclusions that keep it specific. Running that definition across a health system’s data is computational phenotyping, and it’s an established, peer-reviewed approach. Researchers have used EHR-based phenotyping to estimate rare disease prevalence and identify candidate patients at scale (estimating rare genetic disease prevalence from EHRs).

Done well, it produces a ranked, explainable list of candidates, each with the specific evidence that flagged them, rather than an opaque score.

Why real-world data, in place

Rare disease signals are subtle and longitudinal, so the method depends on three things:

Breadth of data. Real-world clinical data (diagnoses, labs, medications, procedures, notes) captured over years, not a single visit.
A standard format. HL7 FHIR gives that data a common shape, so a definition written once can run against the records a health system already produces.
Analysis where the data lives. Running the analytics inside the health system’s own environment means no protected health information has to be copied out, which matters for HIPAA-aligned work.

The definition matters more than the model

The machine learning that ranks and surfaces candidates matters, but the part that determines whether the output is trustworthy is the definition itself. If the criteria for “who counts” can drift between runs, the cohort isn’t reproducible. And in rare disease, where each candidate is a real person who may have waited years, that’s unacceptable.

That’s why we treat disease definitions as governed content, validated by key opinion leaders and versioned like code. More on that in Why rare disease definitions belong under governance.

Want to see this applied to a cohort in your data? Request a demo.

Sources

Estimating prevalence of rare genetic disease diagnoses using electronic health records. PMC.
HL7 International, Fast Healthcare Interoperability Resources (FHIR).

This article is for general educational purposes and is not medical advice.

Robert Smith, Co-founder & CTO

Sagacity Diagnostics, rare disease clinical decision support. Published May 6, 2026.

See PathfindEHR™ in action

Find the rare disease patients hidden in your data.

Request a demo