Closing the Loop
Put the two problems together and the shape of the underlying issue comes into focus. There’s a gap between how the tool performs on your evaluation set and how it performs on your users’ actual documents — and that gap persists because production failures aren’t making it back to the evaluation set. There’s a trickle of valuable feedback, from users who fed in documents the tool couldn’t handle, that crosses none of the thresholds required to actually inform what gets built next. The production gap stays open because the feedback loop is open.
Closing the loop means making the path from production failure to evaluation coverage short, deliberate, and routine. Not automatic — the loop requires human judgment at several points, specifically at the step where someone decides what class of document a particular failure represents, and what that class implies for what to handle next. But the infrastructure around that judgment should be as frictionless as possible: it should be easy to surface a failed document, easy to preserve it, easy to flag it as a coverage gap, and easy to act on the flag. Most teams have none of this. The production failure happens, the user works around it, and the gap remains — not because anyone decided to ignore it, but because the path from failure to action was never built.
What a closed loop looks like in practice: every time the tool produces output below a confidence threshold, or explicitly declines, that event is recorded with enough context to route it to someone who can decide whether it represents a new coverage gap. The documents that triggered the event are preserved. Periodic review turns the collection of failures into a prioritized list of what to handle next: the document types that appear frequently enough, and whose handling is tractable enough, to be worth the effort. Each item that makes it to the top of that list generates a document type the tool now handles that it didn’t before — and the eval set grows to cover it, so the handling holds.
The reason this is worth building deliberately is the same reason the tail is the moat. A team with a closed feedback loop learns faster than a team without one. Every real-world failure that makes it through the loop and into the eval set is one less blindspot, one more coverage expansion, one step further down the tail. The compounding is real: the coverage you add this month is still there next month, and the month after. The team without the loop is also improving — but slowly, by encountering the same failures again, building the same context from scratch, occasionally fixing without adding to the eval and watching the regression reappear. The loop is what makes improvement durable.
So the three-part synthesis: the production gap is structural, not a bug (the eval set will always trail the real document pile until you close the loop); the failures that could close the gap are being lost (they don’t survive the three thresholds without deliberate infrastructure); and the tool that builds that infrastructure improves faster, compounds its coverage, and extends its lead down the tail while competitors are still trying to hold the head. The loop is not a quality program or an operational nicety. It’s the mechanism by which a document tool gets better at the thing that actually matters.