The Plausible Wrong Answer
If users skim the output rather than auditing it, then the error that matters isn’t the one that looks wrong — it’s the one that looks right. An obviously broken value is almost harmless: a date where a dollar amount should be, a garbled string, a number with too many digits. The skimming user’s eye snags on it precisely because it’s incongruous, and they fix it. The tool made a mistake, but the mistake announced itself. The truly costly error is the plausible one: a value that’s the right type, in the right range, formatted correctly, and simply wrong. Nothing about it interrupts the skim, so it passes straight through into the user’s real work wearing the disguise of a correct answer.
This inverts the intuition about which errors to fear. A tool’s worst failures, ranked by damage, aren’t its most visibly broken outputs — they’re its most confident-looking incorrect ones. An extracted figure that’s off by a transposition but still looks like a normal figure. A field pulled from the wrong row of a table that happens to contain a value of the right shape. A number that’s correct except for a unit the tool silently assumed. Each of these is plausible by construction, which is exactly what makes it dangerous: plausibility is the property that lets an error survive the only review it’s going to get.
The design implication is that a tool can’t treat all of its outputs as equally trustworthy just because they all look equally finished. The output that’s most likely to be a plausible-wrong-answer — pulled from an ambiguous layout, extracted against a low-quality scan, sitting at the edge of what the model is sure about — needs to be marked as such even though, on its face, it looks as clean as every other field. The tool knows things about its own uncertainty that the rendered value doesn’t show. Throwing that knowledge away and presenting a shaky extraction identically to a rock-solid one is how a plausible error gets laundered into a trusted fact.
There’s a quieter discipline here too: the tool should be more suspicious of its own clean-looking results, not less. It’s tempting to flag only the outputs that look messy, because those are the obvious candidates for error. But the messy ones are self-policing — the user catches them. The effort is better spent detecting the outputs that look fine but rest on a weak foundation, because those are the ones the human review won’t catch. A tool that only surfaces uncertainty when the result also happens to look uncertain is covering the case that didn’t need covering.
The general lesson is that error visibility and error cost are inversely related when a human skims. The bugs that show are cheap; the bugs that hide are expensive. A tool that wants to be genuinely reliable in real use has to spend its effort on the hidden ones — on catching and surfacing the plausible wrong answer before it slips through the glance — rather than on polishing away the obvious mistakes the user was always going to fix anyway. Make the dangerous errors visible, because the visible errors were never the danger.