Reading physician notes at recruiting scale.

Around 80% of clinical trials miss their enrollment timelines, and chart reading is the most-cited bottleneck. This writeup is a record of how I built a local-only screener that surfaces eligible patients hidden in unstructured EHR data, and what it would have meant for a real heart-failure trial.

Chapter 00

The recruiting bottleneck

Coordinator chart review compresses by roughly 25x, from around 50 minutes of reading a chart from scratch down to around 2 minutes of verifying cited evidence. Under that compression, PARADIGM-HF's three-year enrollment timeline collapses toward a single quarter.

The trial worth grounding this work in is PARADIGM-HF, the trial that established sacubitril/valsartan as the standard of care for heart failure with reduced ejection fraction. It ran from 2009 to 2012, enrolled 8,442 patients across 1,043 sites in 47 countries, and took three years to assemble the cohort. The drug is now prescribed to millions of patients globally, which means every year of enrollment delay was a year that the prior standard of care stayed on label.

Trials usually take this long because eligibility screening is slow. Around 80% of clinical trials miss their enrollment timelines, and screening is the most-cited reason. The published Phase III benchmark puts coordinator chart review at roughly 50 minutes per chart, with about 88% of the patients reviewed turning out not to qualify, which adds up to roughly 7.5 hours of reading time per enrolled patient. A coordinator at a busy HF clinic will read on the order of thirty charts before landing a single candidate worth approaching.

I built a screener that reads every chart locally and hands the coordinator a short queue with citations already attached. For each of the six note-dependent eligibility criteria, the model emits a verdict (PASS, FAIL, or UNKNOWN) along with a verbatim quote from the chart that justifies the verdict. Instead of reading the chart cold, the coordinator checks whether the cited quote actually means what the model says it means in the surrounding context, which works out to roughly a 2-minute exercise per chart.

That 25x compression applies to the part of the trial calendar that was binding. The incompressible parts (the run-in period, the consent visit, the regulatory ramp at each site) all stay where they are, since none of them can be replaced by software. Once chart reading stops being the constraint, though, the three-year enrollment window collapses toward the consent-and-run-in floor, which lands closer to a single quarter than to three years.

Faster trials reach patients faster. The drug PARADIGM-HF tested is now the recommended first-line therapy for HFrEF, and finishing enrollment two years earlier would have meant two more years where doctors could have prescribed the better drug to the people in front of them. That is the Good worth building toward.

The rest of this writeup walks through how the screener is built and what happened when I ran it across the cohort.

Chapter 01

MIMIC-IV and privacy posture

The privacy posture for this project doubles as the product wedge for the version of it that ships into hospitals.

MIMIC-IV is credentialed PHI under PhysioNet's Health Data Use Agreement. The agreement allows reading the data on credentialed user machines, but it forbids posting the text anywhere public, shipping derived datasets that contain notes, or handing the text to multi-tenant LLM APIs. That last clause puts the public OpenAI and Anthropic endpoints off the table for any inference call that includes note text.

That constraint shapes everything downstream. Inference happens inside the same controller boundary as the data, which means locally on Apple Silicon for development and on single-tenant H100 instances inside my own GCP project for batch runs. The H100 is treated as an extension of the local environment for the duration of the run, then torn down afterwards. No MIMIC text ever crosses a public surface in either direction.

The same constraint is also the product wedge for the version of this tool that ships into hospitals. A hospital IT team cannot sign a SaaS agreement that ships chart text to a third-party LLM provider, because they don't have the contractual posture to authorize that flow. A screener that runs locally on the hardware footprint they already manage is the only version of the tool that gets through procurement.

Access path

The structured EHR tables come from MIMIC-IV, and the ~331,000 discharge summaries come from MIMIC-IV-Note, both under the per-email credentialed grant that PhysioNet operates. Every BigQuery call against either dataset goes through a single gated query interface, which statically analyses the query and restricts text-consuming patterns to bounded predicates like lengths, regex matches, and aggregate counts. Any query that would return raw note rows gets refused at the gate, so no note text leaves the project even by accident.

The code, SQL, prompts, and schemas are all open-source, and the DUA actually requires that openness for any associated publication. The data those artifacts touch is what stays inside the boundary.

Chapter 02

Selecting the cohort

PARADIGM-HF enrolled through outpatient cardiology clinics, but MIMIC-IV is built from inpatient hospital data, so the first task was bridging that setting gap.

PARADIGM-HF enrolled chronic stable HFrEF patients on optimal medical therapy through outpatient cardiology clinics. MIMIC-IV records the other side of cardiac care, namely admissions, discharges, and ICU stays. I bridge the gap by screening at discharge, because the discharge summary documents the patient's chronic baseline before they go home. A flagged candidate would then get referred for outpatient enrollment after stabilization, rather than enrolled directly from the inpatient stay.

Era filter

PARADIGM-HF enrolled between 2009 and 2012. MIMIC-IV exposes an anchor-year-group field at the patient level, and I accept the two groups that overlap the trial window. Of 56,676 adult heart-failure admissions across all eras, 41,641 (74%) sit inside the trial window, covering 14,517 unique patients. After applying a structured cohort gate that excludes documented angioedema, 14,430 patients remain as candidates for the LLM screen.

ARNI exposure during this window is around 1%, which collapses the contamination question. The patients being surfaced by the screener were not already on the drug PARADIGM-HF was testing.

How the criteria split

PARADIGM-HF specifies 14 eligibility criteria, and most of them are not the LLM's job. The trial team handles current-state facts during the screening visit (fresh labs, decompensation assessment, hypotension), along with regimen reconciliation, allergy interview, and prior-intolerance history. Structured SQL handles age, recorded NT-proBNP values, and a hard gate on documented angioedema. That leaves the LLM with the six criteria that require reading notes: NYHA class, LVEF, NT-proBNP cited in note text, ACEi/ARB tolerance, β-blocker tolerance, and angioedema history mentioned in notes.

NYHA class is the headline criterion. Explicit NYHA labels are rare in HF discharge notes, but the functional content (dyspnea on exertion, orthopnea, paroxysmal nocturnal dyspnea, exercise tolerance) is everywhere. Deriving the class from narrative is exactly the work this tool exists to do.

Cohort-construction SQL never touches note text, since text-consuming patterns are refused at the gate described in the previous chapter. Materialized cohort tables live inside the GCP project, and the only artifact that ever ships is aggregate metrics.

Chapter 03

Synthetic ground truth

The eval set is 400 synthetic discharge summaries with verdicts known by construction, generated to look like the real corpus they were trained against.

Evaluating an eligibility screener requires labeled examples, namely notes paired with the verdict each criterion should produce. Hand-labeling real MIMIC notes against six criteria is slow, burns clinical-reviewer time, and produces labels that have to stay inside the controller boundary along with the source notes. The labels cannot appear in any public artifact, which makes them awkward to share across collaborators and impossible to publish.

Synthetic notes get around both problems, provided they are realistic enough that a screener tuned on them generalises to real cohort notes. The build is a two-pass pipeline. The first pass analyses the real corpus and summarises it into reference documents, and the second pass uses those references to generate synthetic notes whose verdicts are known by construction.

Anchoring the generator in real-corpus reality

The corpus-analysis pass runs a map-reduce job over 200 cohort discharge notes using Gemma 4 31B. Each note gets one call in the map step, producing structured observations across sections, quirks, and per-criterion conveyance. The reduce step folds those observations into 8 reference documents: a note profile, a quirk catalog, and one conveyance document for each LLM-handled criterion. The outputs are aggregate descriptive prose, with no quoted spans or patient details. The full pass takes around 45 minutes on a single H100, which works out to roughly $3 on a spot VM.

The reference documents give the generator a lot to work with. They describe how sections are typically ordered, which abbreviations are common, how clinicians convey NYHA class when they don't write the literal label, and which idioms signal "tolerated for years" versus "intolerant after one dose."

Generating and validating

The initial run was 35 cases, each with a verdict tuple (six criteria, each one of PASS, FAIL, or UNKNOWN) and a target note count. For each case, a Sonnet subagent generates synthetic discharge summaries and an Opus subagent validates them against the spec. A case is considered done only when the validator returns case_pass: true.

Validation catches subtle drift that the generator introduces. A clinician's pen slip ("no angioedema" written into a drug-tolerance summary) flips the safety-exclusion verdict, and a post-2014 term like "HFmrEF" breaks the era. Failed cases regenerate from scratch, since whole-case retry preserves the multi-turn cache benefit and avoids stitching together notes from different generation runs.

When the validator and the spec disagree

Some specs turned out to be infeasible. Standard discharge templates carry inherent verdict signals, in the sense that a complete physical exam will mention edema and a complete review of systems will document or deny shortness of breath. Writing a non-cardiac admission with NYHA=UNKNOWN means asking the generator to omit standard documentation in a way that itself looks abnormal, which is worse than letting the note imply a verdict that diverges from the original spec.

The rule I settled on is that if the validator reads the note and produces a verdict that diverges from the spec, the validator wins and the note's verdict becomes the gold label. The insight that ground truth should follow the note rather than the design intent was load-bearing for getting four held cases unstuck.

Final state

The final silver set is 38 cases and 400 notes, with every case passing validation. The label distribution matches real-cohort prevalence: NT-proBNP citations and angioedema mentions are dominated by UNKNOWN (both are rare in real notes), and the other criteria spread across PASS, FAIL, and UNKNOWN. All 18 cells in the criterion-by-verdict matrix are populated.

Chapter 04

Evaluating the screener

The frozen test eval lands at 97% verdict accuracy and 100% citation faithfulness, both measured on a held-out split of 200 notes that the prompt iteration never touched.

The held-out test split is 200 notes, kept out of every prompt-tuning loop. Each note crossed with the six criteria produces 1,200 verdicts, and each verdict gets scored on two axes: whether it matches the gold label, and whether every emitted citation appears verbatim in the source note.

A strong model with naive prompts almost gets there

Out of the box, a strong model with simple prompts hits roughly 98% verdict accuracy on this set. The bake-off in an earlier session selected Gemma 4 E4B-it served by vLLM with MTP speculative decoding, running 8.3 times faster than the 31B baseline at the same accuracy. Per-label v_acc was 97.9%, and the all-six-criteria-in-one-call mode landed at 98.5%. The task itself is tractable from a verdict-accuracy standpoint.

The catch is that E4B is about 16 GB. Apple Silicon laptops with 16 GB of unified memory cannot run it alongside an operating system and a browser, and the product story I want (local-only inference on the same Mac the coordinator already uses) needs the model to fit in roughly half that footprint. The eval work therefore shifted to E2B, which weighs in around 8 GB.

E2B regresses, then prompts recover

Naive prompts that produced 98% on E4B produced 94.75% on E2B at the per-label level, with two catastrophic failures appearing in all-six mode. Excl4 (angioedema history) collapsed to 41% v_acc because the model treated "no angioedema mentioned" as PASS rather than UNKNOWN. The "silence is not a verdict" rule was buried mid-paragraph, and the smaller model lost track of it when six criteria competed for attention inside the same prompt.

The fix was a "bulletproof" rewrite applied per criterion. Each criterion gets mechanical decision rules with explicit "either path alone is sufficient" framing, step-by-step decision flows for compound criteria, and loud anti-patterns for the rules that the small model breaks. Per-label v_acc rose from 94.75% to 98.34%, and Excl4 in all-six mode jumped from 41% to 98%. After the rewrite, E2B with bulletproof prompts matches E4B with naive ones, which means the small-model penalty closes at the prompt layer rather than at the weights.

A counterintuitive finding came out of the same phase. Stripping the worked examples from the criterion files actually improved both accuracy and citation faithfulness. The model had been reciting example phrases verbatim as fabricated citations on test notes that did not contain them, so the few-shot benefit washed out on the small model while the recitation cost was real.

Citation faithfulness is the production guarantee

In the production loop, every PASS or FAIL flows to a human reviewer who reads the cited substring against the criterion. UNKNOWN verdicts route to a full-note reviewer instead. The two failure modes carry asymmetric downstream costs.

When the model produces a wrong verdict but cites a real verbatim quote, the reviewer reads the cited string, sees that it does not actually meet the criterion, corrects the verdict, and moves on. There is no clinical harm, only a small amount of reviewer time. The dangerous case is when the model produces a confident PASS but invents the supporting quote. The reviewer goes to the chart looking for evidence that does not exist, and either has to redo the entire review from scratch or, worse, approves the patient on the strength of a plausible-looking fabrication.

That asymmetry shifts the production target. Wrong verdicts get caught when a reviewer reads the cited quote and sees that it does not actually meet the criterion. Invented citations do not get caught the same way, since the reviewer is reading what the model wrote rather than what the chart says. The production posture therefore accepts some verdict noise and refuses all citation fabrication.

Verify-and-rewrite, then drop the rest

I inspected the unfaithful citations on E2B+MTP all-six and found two categories. Around 70% of the failures were cosmetic drift, with deidentification placeholders normalized, first-letter capitalization applied to quotes that begin mid-sentence, internal whitespace collapsed, and smart-quote substitution. The other 30% was structural fabrication, including invented suffixes on real prefixes, cross-region concatenation, and drug-name swaps.

The cosmetic class is fully recoverable with a deterministic post-process. For each emitted quote, the post-process tries three strategies: a lowercase substring match, a length-anchored slice from the closest contiguous match, and the longest verbatim sub-match. Any quote that survives all three with similarity above 0.85 gets rewritten to its verbatim form, and anything below the threshold gets refused. Raw citation faithfulness on the test set rose from 92.25% to 98.92% after this pass.

The structural-fabrication remainder is the model floor, and abstention is the right answer for it. If the post-process refuses to rewrite at least one quote on a row, the verdict gets demoted to UNKNOWN with no quotes attached and a "dropped for unverifiable citation" flag. The row routes to the full-note reviewer queue rather than the citation-anchored review path, which converts the dangerous failure mode (a fabricated PASS) into the safe one (an UNKNOWN that costs reviewer time but never misenrolls).

Final numbers

The frozen test eval with the full pipeline applied:

criterion     n   v_acc  s_faith  macroF1
Incl3a      200  96.5%  100.0%   0.87
Incl3b      200  98.5%  100.0%   0.97
Incl3c      200  96.0%  100.0%   0.92
Incl4       200  94.5%  100.0%   0.87
Incl5       200  99.0%  100.0%   0.98
Excl4       200  97.5%  100.0%   0.95
avg              97.0%  100.0%   0.93

Verdict accuracy averages 97.0%, citation faithfulness is 100% by construction, and macro-F1 is 0.93. The configuration runs in roughly half the memory of the strong-model baseline while delivering a citation guarantee that the strong model did not have either. This is the screener that runs on the cohort.

Chapter 05

Scaling to the cohort

Every adult heart-failure admission in MIMIC-IV's trial window goes through the screener in a single 6h 35m run on one H100, at $23 of spot compute.

The eval establishes the screener's behaviour on synthetic notes. The cohort run extends that to real ones, processing every adult HF admission in MIMIC-IV's trial window under the same locked configuration.

The run

The pre-note cohort holds 14,517 unique subjects, contributing 41,208 discharge summaries across their era admissions. The screener configuration is the one I froze at the end of the eval chapter: Gemma 4 E2B-it served by vLLM with MTP speculative decoding, all six criteria emitted in a single call, with the citation post-process and the drop-to-UNKNOWN abstention applied at write time.

Steady-state throughput was 1.6 calls per second on a single H100 with 16-way concurrency. The full run took 6h 35m of wallclock time and cost roughly $23 in spot compute. Parse failures landed at 21 out of 41,208 calls (0.05%), and I did not retry them, so those cells are absent from the output. The model consumed 322 M prompt tokens and emitted 61 M completion tokens.

Per-criterion verdict distribution

Note-grain counts after the citation post-process and the drop-to-UNKNOWN policy:

criterion     n      PASS    FAIL     UNK    rewrt    drop
Incl3a    38,027   21,805     515  15,707   7,071   1,314
Incl3b    38,027    7,520  17,674  12,833   3,118   1,582
Incl3c    38,027    7,621   1,061  29,345     189     688
Incl4     38,027   14,066   5,983  17,978   3,494     586
Incl5     38,027   22,851   3,754  11,422   7,663     476
Excl4     38,025      202     178  37,645      78       6

A few patterns are worth flagging. Incl3b (HFrEF, EF ≤ 40%) is the largest source of explicit FAIL, which is the expected pattern, because a real-world HF population contains a lot of HFpEF and the model correctly separates them. Incl3c (NT-proBNP elevated) is dominated by UNK, also as expected, because discharge summaries rarely print BNP values in the narrative even when the lab is in the structured record. Excl4 (history of angioedema) is 99% UNK, which is structural and worth a section of its own.

Per-patient outcome

I roll up note-grain verdicts to per-patient verdicts by taking any-PASS over the patient's notes, then any-FAIL, then UNKNOWN. Applying the trial's logical structure (all five inclusions PASS and the exclusion FAIL) gives the headline:

n_pre_note_cohort     14,517
n_screened            14,429
ELIGIBLE                  23
INELIGIBLE             9,146
INDETERMINATE          5,260

The 88 unscreened patients reflect cases where every note parse-failed or fell outside the materialized cohort vintage. That gap sits at 0.6% at the patient level.

The Excl4 overhang

5,260 patients land in INDETERMINATE because at least one criterion is UNKNOWN. The most useful drill into that bucket is the 997 patients with a single UNKNOWN, who sit one chart review away from a verdict. Of those 997, 990 are blocked solely on Excl4, and the remaining 7 are blocked on Incl3c.

The Excl4 result is structural rather than a screener weakness. The pre-note cohort SQL already excludes patients with ICD-coded angioedema, so every patient reaching the LLM has already cleared the structured angioedema check. The LLM's Excl4 evaluation is therefore a verification pass over discharge text that, by base rate, does not restate the absence of a rare prior reaction.

Recognising that overlap and letting "structured filter passed, Excl4 UNK in note" count as exclusion-cleared shifts the rollup:

ELIGIBLE          1,013     (23 strict + 990 1-UNK-on-Excl4)
BNP review only       7     (single UNK on Incl3c, single chart-check resolves)
INDETERMINATE     4,263

The strict 23 is the conservative headline. The relaxed 1,013 is the realistic operational picture once the structural redundancy between the SQL filter and the LLM is acknowledged.

Scope boundary

The screener reads discharge summary text. Clinic notes, echo reports, current-state lab values, medication lists, and allergy intake forms are not inputs to this run. PARADIGM-HF's screening visit covers allergy interview, regimen reconciliation, fresh labs, and decompensation assessment, and the trial's later exclusions live there. The verdicts in this chapter cover the note-screenable subset of the eligibility criteria rather than the complete determination.

Chapter 06

What this would have meant in 2012

The cohort run is an existence proof at one academic centre, reading one note type, taking 6 hours of GPU time. Scaled across PARADIGM-HF's actual deployment footprint, the three-year enrollment timeline collapses toward a single quarter.

Chapter 00 makes the headline claim that chart review compresses by roughly 25x and the trial calendar collapses with it. This chapter walks the math behind that claim and lays out the assumptions it depends on.

The compressible work

The Phase III chart-screening benchmark comes from Penberthy 2012, Table 3, with three figures that matter here: roughly 50 minutes per chart, about 8 charts screened per enrolled patient, and a total of roughly 7.5 hours of coordinator chart-review time per enrolled patient. Most of that burden is driven by the 88% non-qualify rate, since coordinators are spending most of their reading time on patients who turn out not to qualify.

The screener takes over the unguided 50-minute read. The coordinator no longer searches a chart for evidence, because the model has already produced a verdict for each criterion along with a verbatim quote from the chart that supports it. The coordinator's work shifts to verifying that the cited quote means what the model says it means in the context of the rest of the chart, which is roughly a 2-minute exercise per candidate. That ratio is the 25x compression on the dominant cost.

The 88% reject rate also disappears from the coordinator's day, because rejected charts never reach her queue. The model surfaces only candidates with citable evidence, so a coordinator who previously spent 7.5 hours of chart review to land one enrollment now spends roughly 20 minutes of verification time per enrolled patient.

One site, one note type

The cohort run from the previous chapter is the existence proof at one academic centre. The screener evaluated 14,517 patients across 41,208 discharge summaries in 6 hours of GPU for $23, and surfaced 23 strict and 1,013 relaxed candidates with verbatim citations attached.

PARADIGM-HF's per-country enrollment target was roughly 180 patients. The single-site MIMIC slice already surfaces five times that target as a pre-screened candidate list, from one academic centre reading one note type in one cohort window. The screener can afford to look at every patient in the eligible pool, because reading is no longer the expensive step. Most of the 14,517 patients are not candidates, and that is acceptable in this regime, since the cost of evaluating a non-candidate is negligible.

What's missing from this slice

Discharge summaries are one note type among many in a hospital's EHR. The same record will also carry admission histories, daily progress notes, cardiology consult notes, echo reports, allergy intake forms, and outpatient clinic notes. The criteria that performed worst in this run (Excl4 at 99% UNK and Incl3c at 77% UNK) are exactly the criteria that those other note types are written to document. Echo reports record ejection fraction. Allergy intake records prior drug reactions and the reasons for discontinuation. Cardiology clinic notes carry longitudinal NYHA assessments more reliably than any single discharge summary.

Adding those channels removes the need for the structural-overlap relaxation that gets us to 1,013. Excl4 gets answered directly from allergy intake, BNP gets answered from the lab-aware cardiology note, and NYHA gets corroborated across visits. Once those data sources feed the screener, the strict and the relaxed numbers converge upward rather than staying split.

What's left after chart reading is gone

PARADIGM-HF's three-year enrollment timeline reflects coordinator throughput that could not outpace patient flow at the sites. Once the chart-reading bottleneck is stripped, what remains is the incompressible floor: a 4-8 week run-in period per patient, a 1-2 week consent-and-screening visit, and the per-site regulatory ramp that takes its own time. None of those can be replaced by software.

With pre-screened candidate queues sitting in the hundreds at each site at trial kickoff, and the screener re-reading new admissions on a weekly cadence afterwards, the funnel fills faster than coordinators can drain it. Per-site enrollment becomes a parallel-coordinator problem rather than a chart-reading problem. An enrollment campaign sized like PARADIGM-HF finishes in single-digit months under that regime, with a single quarter sitting at the aggressive end of the defensible range.

Honest caveats

The Penberthy figures come from oncology trials, and transferring them to cardiovascular trials is a benchmark assumption rather than a measured fact. The reading work itself is similar enough across therapeutic areas that the cross-application is reasonable, but the exact numbers may shift.

Candidate-list size and enrolled-patient count are also two different things. Candidates decline, fail the screening visit, cannot tolerate the run-in, or fail to meet a criterion that the discharge note did not actually surface. Standard trial attrition is large, and this work removes one source of friction without removing all of them.

MIMIC-IV is a tertiary academic centre with a sick patient base. HFrEF prevalence per admission is higher there than at most community hospitals, and the 1,013 candidate figure scales down at smaller sites.

The 25x per-chart compression is an estimate of coordinator verification time on cited evidence, not a production-measured figure.

The Good

PARADIGM-HF showed a 20% reduction in cardiovascular death or HF hospitalization versus the prior standard of care. The drug it tested is now prescribed to millions of patients globally and is the recommended first-line therapy for HFrEF.

The trial finished enrolling in 2012, results published in 2014, and practice shifted shortly after. Three years of enrollment is three years that patients waited for the new standard of care. Compressing that to a quarter does not change what the trial finds, but it changes how quickly better treatment reaches the patients who need it.

The cost story matters too. Per-enrolled-patient screening cost in the published benchmark works out to roughly $290 (Penberthy 2012, Table 3), which puts PARADIGM-HF's total screening cost at roughly $2.4M across the trial. The screener's marginal cost is the GPU time, which came to $23 for a complete cohort run at one site, with the rest of the savings accruing to the coordinator workforce.

Faster trials at lower cost mean more trials get run, more drugs get tested, and the trials that move the standard of care reach patients sooner. That is the work this project is built around, applied at scale.

Sources

Penberthy LT, Dahman BA, Petkov VI, DeShazo JP. Effort required in eligibility screening for clinical trials. J Oncol Pract. 2012. PMC3500483. Table 3, Phase III row gives the 50 min/chart, 7.5 hrs/enrollment, 8 patients screened per enrollment, and ~$290/enrollment figures used in this chapter.
Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success. Contemp Clin Trials Commun. 2018. PMC6092479. Source for the 80% missed-enrollment figure.
McMurray JJV et al. Angiotensin-neprilysin inhibition versus enalapril in heart failure (PARADIGM-HF). N Engl J Med. 2014. NEJMoa1409077. Source for the trial design, enrollment figures (8,442 patients across 1,043 sites in 47 countries between 2009 and 2012), and the 20% cardiovascular outcome reduction.