The Extraction Boundary

Document processing tools fail in two predictable ways: they try to extract things that require reasoning rather than extraction, and they defer to humans on things that are straightforward to automate. Both errors are expensive. The first produces unreliable output that erodes trust. The second produces a tool that doesn’t justify the integration cost.

The extraction boundary is the line between what can be reliably read from a document and what requires interpretation beyond the text. Most fields in a structured document fall clearly on one side. A date, a dollar amount, a party name — these are in the document or they’re not. Extract them correctly with provenance, or report that they’re absent. A system that confuses absent data for data it should infer is making a category error: it’s doing reasoning when it should be doing extraction, and the output will be wrong in ways that are hard to detect.

The harder cases are fields that appear to be in the document but require domain context to interpret correctly. A “base rent” figure might appear verbatim in a lease, but whether that figure is monthly or annual, gross or net, inclusive of certain costs or exclusive of others — those interpretations depend on surrounding language and standard practices in the relevant domain. A system that extracts the number and presents it without that context has provided a number that can’t be used safely. The user still needs to read the lease to understand what the number means. The automation has not reduced their work; it has added a step.

The right response to this class of problem is not to try to infer the context automatically — at least not at v1. The right response is to extract the field, extract the surrounding evidence (the sentences and clauses that would allow a human to interpret the figure), and surface both together. The extracted value plus the cited evidence gives the user what they need to make the judgment call. The system has done the work it can do reliably; the user does the work that requires their expertise. That’s a clean division.

Where the boundary sits affects product design downstream. If you’ve correctly identified what’s automatable, you can build confident UI around those fields — show the extracted value, show the source, let the user verify with minimal friction. If you’ve incorrectly included judgment calls in the automated layer, every one of those fields becomes a liability: users will trust values they shouldn’t, catch errors too late, or stop trusting the tool entirely. An extraction system that knows its boundary earns trust by being reliably right about the things it claims to know. An extraction system that claims to know things it’s guessing at loses trust by being wrong in unpredictable ways.

Define the boundary before you build the system. Then enforce it in the output format: extracted values with citations on one side, fields requiring human judgment flagged as such on the other. The discipline of maintaining that distinction is what makes the tool trustworthy rather than merely impressive.