The Confidence Score Trap

There’s an obvious-seeming improvement you can make to an extraction system: attach a confidence score to every field. The system extracts a value, and alongside it reports a number — 0.92, 0.47 — that’s supposed to tell the user how much to trust the extraction. It looks like transparency. It looks like exactly the honesty that builds trust. In practice, an uncalibrated confidence score is worse than no score at all, because it converts an honest “I don’t know how sure I am” into a precise-looking number that the user will take literally.

The problem is calibration. A confidence score is only useful if it means something consistent: if fields scored 0.9 are correct about 90% of the time and fields scored 0.5 are correct about half the time, the number is doing real work. The user can set a threshold, review everything below it, and trust everything above. But most confidence scores that come out of a language model are not calibrated this way. They’re a model’s internal sense of fluency or plausibility, which correlates with correctness loosely and unpredictably. A model can be highly confident and wrong — confidently extracting a number that reads cleanly but came from the wrong row of a table. The 0.92 doesn’t mean 92% likely correct. It means the model found the output linguistically comfortable.

This is a trap specifically because the score is actionable-looking. A user who sees a 0.92 will skip verification. A user who sees a 0.6 will spend time checking. If those numbers don’t track actual accuracy, you’ve directed the user’s attention to exactly the wrong places — they’re verifying confident-correct fields and trusting confident-wrong ones. You’ve spent their attention budget and made the output less safe, while feeling more transparent. The score has laundered uncertainty into false precision.

The honest alternatives are less elegant but more useful. The first is to not show a number at all, and instead surface the evidence — the citation, the source location, the surrounding context — and let the user calibrate their own trust from material they can actually inspect. A citation is a verifiable claim; a confidence score is an unverifiable one. The second is to show confidence only where it’s genuinely calibrated, which usually means a small number of well-understood field types where you’ve measured accuracy against ground truth and the number means what it says. Partial, earned confidence scoring beats universal, fabricated confidence scoring.

If you do show scores, measure them. Take a labeled sample, bucket the predictions by reported confidence, and check whether the accuracy in each bucket matches the score. If the 0.9 bucket is 70% accurate, the score is lying, and you either recalibrate it or remove it. A confidence score you haven’t validated against real outcomes is decoration — and decoration that users mistake for information is a liability. The goal isn’t to look transparent. It’s to give the user something they can act on correctly.