Absent vs. Unknown | Latent Logs

When an extraction system returns an empty value for a field, it can mean one of two things: the field is absent from the document (the information is genuinely not there), or the extraction failed to find the field (it might be there, the system just couldn’t locate it). These are different situations with different implications. Most extraction systems collapse them into a single null or empty output and provide no way to distinguish between them.

This is a quiet design mistake that compounds over time. A user who sees an empty “early termination clause” field has two possible interpretations: the lease has no early termination clause (absent), or the extraction system missed it (unknown). These interpretations lead to very different actions. If the field is absent, the user can note that absence as a material fact and proceed. If the field might just be missed, the user has to read the lease manually to check — which is exactly the work they were trying to avoid. An output that can mean either thing is an output that can’t be trusted for the absent case, because it might be the unknown case.

The output schema for an extraction system should make this distinction explicit. Three states, not two: present (value found, here is the value and its citation), absent (field not found in document, here is the evidence — the surrounding context that establishes the field isn’t present), and unknown (extraction could not determine whether the field is present or absent). Unknown is the failure state — it means the system ran into a limitation and couldn’t give a confident answer either way.

Surfacing the absent state with evidence is the harder part. Saying a field is absent is a positive claim that requires justification. The system should be able to point to something in the document that supports the absence — a clause that would typically contain this information but doesn’t, a section header followed by no relevant content, a standard form with the relevant checkbox unchecked. Absent with evidence is a trustworthy output. Absent without evidence is indistinguishable from unknown.

The unknown state should be rare, but when it occurs it should be honest. If a field spans multiple locations in a large document and the extraction isn’t confident it found all relevant instances, unknown is the right answer. The user knows they need to review that field manually. That’s more useful than a fabricated present or a false absent.

Getting this right in v1 changes how users relate to the tool. An extraction system that distinguishes between absent and unknown earns trust for its absent readings. A system that collapses them erodes trust for both: users learn they can’t rely on an empty field to mean anything definitive, and they stop treating the output as a reliable starting point.