The Verification Budget
A user reviewing the output of an extraction tool has a finite verification budget: a limited amount of attention they’re willing to spend checking whether the extracted values are right. This budget is small and it’s spent fast. The first time a tool wastes it — sends the user to double-check a field that was obviously fine — they spend a little less next time. The first time it under-spends — lets a wrong value through unflagged — they learn they can’t rely on the tool’s judgment and start re-checking everything, which means the tool saved them nothing. The design problem isn’t accuracy in the abstract. It’s directing a scarce verification budget to the fields that actually need it.
Most tools spend the budget badly in one of two ways. The flat-trust tool presents every field with equal weight: a hundred values, no signal about which ones are solid and which are shaky. The user either trusts all of them (and gets burned by the few that are wrong) or checks all of them (and the tool has saved them nothing but typing). The flat-distrust tool flags everything as needing review — every field gets a warning icon, a “please verify” — which is the same as flagging nothing, because the user can’t tell the real risks from the noise. Both fail for the same reason: they don’t differentiate, and verification budget is only well spent when it’s differentiated.
Spending the budget well means concentrating the user’s attention where the risk actually is. A date in a standard position, in a clean machine-readable document, extracted with a clear citation — that should be presented as solid, requiring no review, because it is. A value pulled from a low-quality scan, or from a region of the document where the layout was ambiguous, or a field that required reconciling two conflicting mentions — that should be surfaced for review, prominently, with the specific reason it’s uncertain. The user’s attention goes to the five fields that might be wrong instead of being smeared evenly across the hundred that are mostly fine.
The reason this is hard is that it requires the tool to have a calibrated internal sense of its own risk — not a fabricated confidence number, but a real signal grounded in things it can actually observe: source quality, layout ambiguity, whether a value required inference, whether two parts of the document disagreed. These are inspectable, defensible reasons. “Review this field because it came from a low-resolution region and the surrounding text was garbled” is something a user can act on. “Review this field, confidence 0.6” is not.
The payoff is compounding trust. A tool that reliably tells the user where to look earns the right to be trusted everywhere else. Once a user learns that the unflagged fields are genuinely safe — because the few times something was risky, the tool said so, and was right — they stop re-checking the safe ones. That’s the entire value proposition of an extraction tool: not that it extracts, but that it lets the user stop verifying everything. You only get there by spending their verification budget as carefully as if it were your own.