The Production Gap
At some point you build an evaluation set for your document tool. You collect representative documents, label the correct outputs, run the tool, measure accuracy. You improve the tool until the numbers look good. Then you ship it, and your users start feeding it their actual documents — and the experience they report does not match the numbers you measured.
This is the production gap, and it is not a calibration error. You didn’t measure wrong. Your eval set performed exactly as well as you thought. The problem is that the eval set and the production pile are two different populations of documents, and the difference between them is systematic, not random.
The eval set reflects what you thought to include when you built it. You included the document types you knew about, the variants you had seen, the failure cases you had already encountered and fixed. It skews toward the documents that were easy to label correctly, because hard-to-label documents get excluded or corrected until they’re clean. It reflects your mental model of the problem space at the time you built it — which is the mental model you had before your users showed you what the actual problem space looks like.
Production documents are different in a specific way: they include everything your eval set excluded. They include the vendors you hadn’t heard of, the templates that predate standardization, the scanned copies of already-scanned copies, the documents someone assembled from multiple sources and called a single file. They include the failure modes you haven’t seen yet, which means they include the failure modes you didn’t know to test for. The eval set is the world as you imagined it; the production pile is the world as it is.
This gap doesn’t close by improving your eval score. It closes by improving your eval set — which means letting production failures teach you what to add. Every time a user hits a document the tool handles badly, that document is pointing at a gap in your evaluation coverage. The right response isn’t just to fix the immediate failure; it’s to add that document type (or the class of documents like it) to the eval set, so the fix holds and the coverage grows. Fixing without adding to the eval means you patch the symptom but your measurement still doesn’t see the thing that caught you.
The production gap is permanent if you treat evals as a one-time investment and production as the thing you’re waiting to deploy into. It narrows — slowly, continuously — if you treat every production failure as a contribution to your evaluation coverage. The tool that closes the gap fastest isn’t the one with the best initial eval; it’s the one with the tightest connection between what fails in production and what gets added to the eval set next.