Not All Errors Cost the Same

A single accuracy number flattens something that isn’t flat. “97% accurate” treats every extracted field as one unit, each error counting the same as any other. But the user doesn’t experience errors that way. Getting a property’s square footage off by a digit is a different kind of failure than misreading a non-essential descriptive field, and getting a financial obligation wrong — a payment amount, a deadline, a liability — can be catastrophic in a way that a hundred trivial errors aren’t. Errors have wildly different costs depending on which field they land in, and an aggregate accuracy number is blind to exactly the distinction the user cares about most.

This matters because it changes where reliability should be invested. If you optimize for the aggregate number, you treat all fields as equally worth improving, and you’ll happily trade a small gain on high-stakes fields for a larger gain on trivial ones — because the number goes up either way. That’s backwards from what the user needs. The user would gladly accept more errors on the fields that don’t matter in exchange for near-certainty on the few that do. The reliability budget should be spent in proportion to the cost of being wrong, not the frequency of the field. A tool that’s 99% accurate overall but unreliable on the three fields that carry real consequences is, for a serious user, an unreliable tool.

Identifying the high-cost fields is a domain problem, not a technical one. It requires knowing what the output is used for and what happens downstream when each field is wrong. In one domain the critical fields are dates and dollar amounts; in another they’re identifiers that key into other systems; in another they’re the legal terms that change the meaning of an agreement. You can’t derive this from the document structure alone — you have to understand the user’s workflow well enough to know which errors trigger expensive consequences and which ones are shrugged off. That domain understanding is what separates a tool built by someone who knows the field from a generic extractor pointed at it.

The practical move is to treat high-cost fields as a different reliability class. They deserve more: more careful extraction, more conservative behavior, a stronger bias toward flagging uncertainty rather than guessing, more visible provenance so the user verifies them by reflex. It’s entirely reasonable to handle a critical field with extra caution — surfacing it for review more readily even when fairly confident — while letting low-stakes fields flow through with lighter handling. Uniform treatment is the mistake. The fields don’t carry uniform consequences, so they shouldn’t get uniform care.

The deeper point is that an accuracy metric is a proxy, and optimizing the proxy can diverge from serving the user. What the user wants isn’t a high average — it’s confidence that the errors that would hurt them aren’t there. A tool that understands which errors cost the most, and concentrates its reliability there, can feel more trustworthy at 95% than a competitor at 99%, because it’s right where being right matters and honest where it isn’t sure. Measure what the errors cost, not just how many there are. Then spend your reliability accordingly.