Ripple Labs
Evaluation harness.
A continuous evaluation system that turned a brittle medical-coding model into something a hospital would actually deploy.
✺ — The problem
Ripple's medical-coding model was 91% accurate on the benchmark and 73% accurate in a hospital. The team kept shipping retrains that improved one and regressed the other. They needed evals that reflected real clinical distributions — not whatever Kaggle had lying around.
Sector
Healthcare AI
Year
2025
Duration
11 weeks
Team
1 Principal · 1 ML Engineer · 1 Data Engineer
Stack
✺ — Approach
The same arc as every engagement — tuned to this problem.
Define · Real distribution, not benchmark
We built a labeling pipeline with two coders and one auditor over six weeks. Every test case mirrored real-world prevalence — rare codes were rare, common codes were common. Benchmarks no longer mattered.
Build · A harness, not a script
Continuous evaluation runs on every model checkpoint. Slice-level metrics — by specialty, by hospital, by code rarity — make regressions impossible to hide behind an aggregate number.
Operate · Production canary on every release
A 1% shadow traffic canary runs against every new model for 72 hours. If any clinical slice regresses by more than a configurable threshold, the rollout halts automatically.
✺ — Outcome
Three numbers we’d defend in public.
+11pp
improvement in real-world slice accuracy
100%
of releases caught regressions before prod
2 weeks
the team's release cycle — down from 6
“They didn't touch our model for the first four weeks. They built the thing that let us see what our model was actually doing — and after that, the wins were obvious.”
Head of ML, Ripple Labs