← Index/04 · Healthcare AI · 2025

Ripple Labs
Evaluation harness.

A continuous evaluation system that turned a brittle medical-coding model into something a hospital would actually deploy.

Ripple LabsEvals · Infra

✺ — The problem

Ripple's medical-coding model was 91% accurate on the benchmark and 73% accurate in a hospital. The team kept shipping retrains that improved one and regressed the other. They needed evals that reflected real clinical distributions — not whatever Kaggle had lying around.

Sector

Healthcare AI

Year

2025

Duration

11 weeks

Team

1 Principal · 1 ML Engineer · 1 Data Engineer

Stack

PythonWeights & BiasesBigQuerydbtGitHub Actions

✺ — Approach

The same arc as every engagement — tuned to this problem.

01

Define · Real distribution, not benchmark

We built a labeling pipeline with two coders and one auditor over six weeks. Every test case mirrored real-world prevalence — rare codes were rare, common codes were common. Benchmarks no longer mattered.

02

Build · A harness, not a script

Continuous evaluation runs on every model checkpoint. Slice-level metrics — by specialty, by hospital, by code rarity — make regressions impossible to hide behind an aggregate number.

03

Operate · Production canary on every release

A 1% shadow traffic canary runs against every new model for 72 hours. If any clinical slice regresses by more than a configurable threshold, the rollout halts automatically.

✺ — Outcome

Three numbers we’d defend in public.

+11pp

improvement in real-world slice accuracy

100%

of releases caught regressions before prod

2 weeks

the team's release cycle — down from 6

They didn't touch our model for the first four weeks. They built the thing that let us see what our model was actually doing — and after that, the wins were obvious.

Head of ML, Ripple Labs