Back to Blog
AI & ML

The LLM Context Window Problem: Strategies for Long-Document Processing

Zyptr Admin
17 June 2024
10 min read

128K Tokens Doesn't Mean What You Think

GPT-4o supports 128K tokens of context. Claude 3 supports 200K. Gemini 1.5 Pro supports a million. So long documents are a solved problem, right? Not even close. We've tested all of these on real document processing tasks, and the "just dump it in the context window" approach fails in subtle, dangerous ways.

The core issue is what researchers call "lost in the middle" — LLMs pay more attention to the beginning and end of the context window and less to the middle. We tested this with a 100-page legal contract in Claude's 200K window. We planted a specific clause at different positions and asked the model to find it. Detection rate was 98% when the clause was in the first 10 pages, 95% in the last 10 pages, and 71% in pages 40-60. That 71% is unacceptable for any serious application.

Strategy 1: Hierarchical Summarization

This is our go-to for documents where you need comprehensive understanding but the specific query isn't known upfront (like "summarize this 200-page report" or "identify all risks in this contract"). We break the document into chunks (usually 2,000-3,000 tokens each), summarize each chunk independently, then summarize the summaries. For very long documents, we do this recursively — three or four levels of summarization.

The key insight: each level of summarization should have a different focus. Level 1 summaries capture detail. Level 2 summaries capture themes. Level 3 summaries capture the overall narrative. We encode this in the prompt at each level. The result is a multi-resolution representation of the document that you can query at different levels of detail.

Cost: this approach uses a lot of tokens — roughly 2-3x the document length in total API calls. But the quality is significantly better than single-pass approaches for documents over 50 pages.

Strategy 2: Retrieval-Augmented Processing

For question-answering over long documents, RAG is usually the right approach. But the standard "chunk, embed, retrieve top-K" pipeline has a hidden failure mode with long documents: it loses context. A chunk about "the termination clause" might not make sense without knowing which party it refers to, which is defined 30 pages earlier.

Our solution: contextual chunking. Before embedding, we prepend each chunk with a brief context header generated by the LLM: "This section is from a services agreement between Company A and Company B, specifically discussing termination conditions for the service provider." This additional context dramatically improves retrieval relevance. We saw a 31% improvement in answer accuracy on our legal QA benchmark after adding contextual headers.

We also use parent-child retrieval: we store chunks at two granularities (large ~2000 token chunks and small ~500 token chunks). The small chunks are used for retrieval (more precise matching), but we return the parent large chunk to the LLM (more context for generation). LlamaIndex supports this pattern natively with their SentenceWindowNodeParser.

Strategy 3: Map-Reduce for Specific Extraction

When you need to extract specific information from every part of a long document (like "find all dates and deadlines in this 300-page RFP"), neither RAG nor full-context approaches work well. RAG might miss items that don't match the query embedding well, and full-context processing loses items in the middle.

We use a map-reduce approach: split the document into overlapping chunks, process each chunk independently ("map" phase — extract all dates and deadlines from this chunk), then merge and deduplicate the results ("reduce" phase). Overlap between chunks (we use 200 tokens of overlap) ensures items that fall on chunk boundaries aren't missed.

This is embarrassingly parallel, which means it's fast — we process a 300-page document in about 45 seconds by running 50 chunk extractions concurrently. The reduce phase takes another 10 seconds. Total cost: about $1.50 with GPT-4o-mini for the map phase and GPT-4o for the reduce phase.

What We Don't Recommend

Don't just stuff everything into the context window and hope for the best. Don't use fixed-size chunking without overlap. Don't chunk on arbitrary character counts — respect sentence and paragraph boundaries. And don't assume that a bigger context window means better performance. We've consistently found that focused, well-structured smaller contexts outperform large unstructured contexts for accuracy.

llmcontext-windowdocument-processingrag
Let's Work Together

Have a Project in Mind?
Great?

Let's talk about building your next product.