The Large Document Problem

Real documents are longer than demo documents. This is almost always true, and it’s the first thing that breaks a document processing tool when it moves from a controlled test environment to production use.

The failure mode is predictable. You build and test against a handful of representative documents — short enough to fit comfortably within a model’s context window, simple enough to cover your core extraction use cases cleanly. The tool works. You ship. Then a real user uploads a 180-page document and the system either errors silently, truncates the input, or produces output from only the first fraction of the document. None of these outcomes are visible until the user notices that fields from the latter half of the document are missing.

The large-document problem is not primarily a technical problem — there are established approaches for chunking, overlap, hierarchical summarization, and retrieval-augmented extraction. The problem is that it’s invisible at demo time and expensive to retrofit. A system designed for short documents has its extraction logic, its output format, its citation mechanism, and its error handling all built around the assumption that the full document fits in one pass. Adapting that system to large documents often requires changes at every layer.

The architecture question is how to handle a field that might be defined anywhere in a long document. Simple chunking works if the relevant content is likely to be localized — a date near the top, a termination clause somewhere in the middle. It fails when a field is defined by the interaction of clauses spread across the document, when a value in section 15 modifies a value defined in section 3. Naively chunking that document and extracting independently from each chunk will produce two answers with no reconciliation.

The citation requirement makes large-doc handling harder. On a short document, citation is straightforward — the extracted value comes from a specific sentence, and that sentence has a location. On a chunked document, the location needs to be preserved through the chunking process and remain meaningful in the final output. A citation that says “from the document” is not useful; a citation that says “from section 12.4, paragraph 2” is.

The practical implication: if the documents your tool will process in production can exceed a few pages, large-doc support has to be in scope for v1. Deferring it creates a class of users who cannot use the tool at all — not users who get a degraded experience, but users who hit a hard wall. Building in large-doc handling from the start means the architecture accommodates it; bolting it on later means rewriting the parts that assumed it wasn’t needed.