Context windows are one of the most misunderstood parts of modern LLM systems. People often hear that a model has a 128k, 200k, or even million-token context window and assume that means it can simply "read everything" with no downside. In practice, context windows are both a capability and a constraint. They determine how much information a model can consider in one interaction, but they also shape cost, latency, prompt design, retrieval strategy, and product architecture.
If you build with LLMs long enough, context becomes an engineering problem rather than a model spec. Long chat threads get expensive. Agent loops accumulate junk. Document pipelines silently exceed limits. Retrieval systems return too much context and make answers worse instead of better. Understanding context windows is how you stop treating long context as a magic feature and start treating it as a budget you manage deliberately.
What a context window is
A context window is the maximum amount of information a model can process in one request-response cycle, measured in tokens.
That last part matters. Context windows are measured in tokens, not words, pages, or characters.
A token is a chunk of text the model's tokenizer uses internally. Sometimes a token is a full word. Sometimes it is part of a word, a punctuation mark, a number fragment, or a chunk of code syntax. This is why "how many pages fit in 128k" is never a stable answer. Legal prose, source code, tables, JSON, and transcripts all tokenize differently.
In practical terms, your total context usage usually includes:
- System instructions
- Developer instructions
- User input
- Retrieved documents
- Tool outputs
- Conversation history you keep in the prompt
- Reserved output tokens for the model's answer
That means the headline context size is never fully available for your source material alone. Part of the window is always consumed by the scaffolding around the task.
Tokens, not words
This is the first place teams make mistakes. They estimate prompt size using word count or page count and then wonder why requests fail or bills spike.
A short technical string can tokenize badly. A dense JSON payload may use more tokens than you expect. Code often consumes context differently from prose. Repeated boilerplate across multiple retrieved chunks can quietly eat large amounts of budget.
This is why token counting should be part of production hygiene, especially for:
- Long documents
- Multi-turn assistants
- RAG pipelines
- Agent systems with repeated tool calls
- Structured prompts with examples
Practical example: count tokens with tiktoken
If you are building with OpenAI-compatible tokenization, tiktoken is a simple way to inspect prompt size before a request is sent.
# pip install -U tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
samples = {
"short prose": "Explain context windows in plain English.",
"json": '{"task":"summarize","max_bullets":5,"tone":"concise"}',
"code": "def add(a, b):\n return a + b",
}
for label, text in samples.items():
token_ids = enc.encode(text)
print("-" * 60)
print(label)
print(f"chars: {len(text)}")
print(f"tokens: {len(token_ids)}")
print(token_ids[:20])The important lesson is not the specific encoding output. It is the habit of measuring prompts in tokens instead of guessing.
How context size affects cost and latency
Longer context is useful, but it is not free.
Every additional token in the input has to be processed by the model. That affects cost directly because most APIs charge by input and output tokens. It also affects latency because larger prompts take longer to move through the model.
This becomes especially important in applications that repeatedly resend history, such as chat systems, copilots, and agent loops. A single message may look cheap in isolation, but a long conversation can become expensive if every turn replays thousands of old tokens.
Why larger windows raise cost even before you "fill" them
The existence of a large context window often changes developer behavior. Teams become less disciplined. Instead of curating context, they stuff more into the prompt because the model allows it.
That leads to three predictable problems:
- Rising per-request cost
- Slower response times
- Worse signal-to-noise ratio
The third problem is often missed. Even if a model can fit a lot of text, that does not mean every extra token helps. Long prompts frequently include stale conversation, irrelevant retrieval results, duplicated instructions, or tool logs the model does not need. You are paying more for context that makes the answer worse.
Latency is a product problem, not just an infra problem
When context grows, users feel it. Assistants become slower. Multi-step workflows feel heavy. Tool-using agents pause longer between actions. In production systems, context growth often shows up first as degraded user experience rather than obvious model failure.
That means context budgeting should be treated like performance engineering:
- Measure token usage by request type
- Measure latency by request type
- Track which prompt components are growing over time
- Cut repeated or low-value context aggressively
If you do not do this, context accumulation becomes a hidden tax on the product.
The lost-in-the-middle problem
One of the biggest misconceptions about long context is that once text fits into the window, the model will use it evenly. That is not how real systems behave.
Long-context models can still struggle to use information buried in the middle of a large prompt. This is often described as the lost-in-the-middle problem: content near the beginning and end of the prompt may be used more reliably than content buried deep inside a long block of text.
This does not mean long context is useless. It means long context still needs structure.
Why this happens
At a high level, attention is not the same thing as perfect retrieval. The model can attend over long sequences, but long prompts create competition among instructions, examples, chunks, tool outputs, and intermediate reasoning traces.
If the prompt is messy, the model has to spread attention across too many signals:
- The key instruction may be far from the relevant evidence
- Irrelevant chunks may crowd out the useful ones
- Redundant text may waste budget without improving recall
- Conflicting instructions may dilute the intended behavior
That is why "just give the model the whole document" often underperforms a well-designed retrieval or summarization pipeline.
Signs you are seeing this problem
In practice, lost-in-the-middle issues often look like:
- The answer uses the first pages of a document and ignores the middle
- The model follows formatting instructions at the top but misses evidence later on
- A long conversation assistant remembers the opening instructions and the latest turn but misses important details from the middle
- Retrieval systems degrade after increasing the number of chunks
When that happens, the fix is rarely "increase context again." The better fix is usually prompt structure and context selection.
Practical strategies for long documents
The safest way to work with long documents is to avoid stuffing everything into one prompt by default. Long context should be used deliberately, not lazily.
When full-context prompting actually makes sense
There are cases where using a large chunk of the document in one pass is the right choice.
This is usually true when:
- The task depends on relationships across distant sections
- You need whole-document style or consistency
- The document is small enough to fit comfortably with output budget
- Retrieval would fragment the reasoning too much
Examples include reviewing a contract for internal consistency, checking whether a policy contradicts itself across multiple sections, or rewriting a short report while preserving tone and structure.
The key phrase is "fit comfortably." If the prompt is close to the model limit, you are leaving yourself no room for output quality or repeated iteration. Whole-document prompting works best when the document is large enough to benefit from joint reasoning but still small enough to leave healthy margin.
Map-reduce and staged processing
One of the most reliable patterns for long documents is staged processing.
Instead of asking one model call to do everything, split the work:
- Break the document into chunks
- Extract or summarize each chunk
- Merge the intermediate outputs
- Run a final synthesis step
This is often called a map-reduce pattern, and it works well because it matches the real constraint: the model does not need every raw token from every page at the same time. It needs the right intermediate state at each stage.
Staged processing is especially useful for:
- Research summaries
- Transcript analysis
- Large report generation
- Compliance review
- Multi-document comparisons
The tradeoff is that you must design the intermediate representation carefully. If the chunk summaries are vague, the final synthesis step will be weak. If the chunk summaries preserve key evidence, structure, and unresolved questions, the final pass can be much stronger than one giant stuffed prompt.
Chunking
Chunking means splitting large documents into smaller sections that can be processed, retrieved, or summarized independently.
Good chunking improves:
- Retrieval quality
- Citation quality
- Cost control
- Traceability
Bad chunking creates fragmented evidence, broken references, and too much overlap.
In practice, chunking works best when it preserves structure:
- Use headings when possible
- Keep semantically related paragraphs together
- Avoid cutting tables, code blocks, or lists in awkward places
- Use overlap only where it improves continuity
Chunking is not only for RAG systems. It is also useful for preprocessing, summarization pipelines, and map-reduce style long-document workflows.
Retrieval over stuffing
If you have a large corpus, retrieval is usually better than prompt stuffing.
Instead of sending every possible document chunk, retrieve the most relevant chunks for the current question and only send those. This keeps the context focused and reduces wasted tokens.
Why retrieval usually wins:
- Lower cost
- Lower latency
- Better relevance
- Easier debugging
Stuffing works best when:
- The document is genuinely small enough
- Full-document reasoning is required
- The information structure is linear and highly interdependent
Even then, you should still think carefully about prompt order and output budget.
Summarisation
Summarization is a context compression tool.
If an agent loop or assistant has accumulated too much conversation history, you often do not need the full raw transcript. You need a structured summary of:
- Decisions made
- Open questions
- Constraints
- User preferences
- Pending tasks
That kind of summary preserves working state without replaying every message.
Summarization is especially useful for:
- Long support conversations
- Agent handoffs
- Multi-step research workflows
- Internal copilots that stay open all day
The key is to summarize into durable state, not vague prose. "The user asked about architecture" is weak. "The user wants a Redis-backed queue, no managed services, and p95 under 250ms" is useful.
Sliding windows
A sliding window keeps only the most relevant recent context while older context is summarized or dropped.
This is often the right pattern for chat and agent systems:
- Keep the latest few turns in raw form
- Keep durable state as a structured summary
- Drop low-value intermediate chatter
Sliding windows are simple but effective because they reflect how many workflows actually behave. The model usually needs the current task state and recent interaction details, not every single token from the last hour.
Put the question near the relevant context
Long-context prompting works better when the task is anchored clearly near the evidence it depends on.
In practice:
- Put the instruction before the context if it governs the entire task
- Put the final question or extraction target near the end if the model should answer from the preceding material
- Use headings and delimiters so the model can distinguish instructions from source text
This matters because prompt layout influences what the model uses effectively. A good long-context prompt is not only shorter. It is easier to navigate.
Context window sizes of major models
This section is time-sensitive, so these figures are accurate as of March 31, 2026 based on official vendor documentation.
GPT-5.4
OpenAI's official model page lists GPT-5.4 with a 1,050,000-token context window and 128,000 max output tokens.
That is unusually large and changes what is feasible for long documents, codebases, and multi-step agentic workflows. But it also raises the importance of discipline: once you can fit huge prompts, it becomes easier to overspend on low-value context.
Claude Sonnet 4.6
Anthropic's official documentation for Claude Sonnet 4.6 lists a 1 million token context window.
For practical planning, that puts it in the same general long-context class as frontier million-token systems. But remember that vendor comparisons are never perfect because prompt construction, hidden system behavior, and output budgeting differ across platforms.
Gemini
Google's official Gemini documentation states that Gemini 3 models support a 1 million token input context window and up to 64k tokens of output.
That makes Gemini a serious option for long-context workflows, especially when multimodal inputs or large corpora are involved.
The real lesson from these numbers
Million-token context windows are real. But they do not remove the need for architecture.
They are best understood as expanding the set of viable strategies, not replacing judgment. You can now do more whole-document and whole-session workflows directly. You still should not assume that sending more context is automatically better.
Agent loops and context accumulation
Agent systems make context problems worse because they generate their own history.
A normal chat app mostly stores user and assistant turns. An agent loop may also accumulate:
- Tool call arguments
- Tool outputs
- Plans
- Intermediate reasoning summaries
- Scratchpad text
- Failed attempts
If all of that is replayed each turn, the context window fills surprisingly fast.
Why agents hit limits earlier than chat apps
An agent may perform ten or twenty internal steps for one user-visible answer. If each step appends logs or observations to the next request, context growth becomes multiplicative rather than linear.
This is why many agent systems degrade over time:
- Cost rises sharply
- Latency rises sharply
- The agent becomes distracted by stale traces
- Important state gets buried
The solution is not bigger windows alone. It is state management.
Better patterns for agent memory
A practical agent should separate at least three kinds of memory:
- Durable state: facts, decisions, constraints
- Working memory: the current subtask
- Ephemeral logs: tool traces and low-value intermediate output
Durable state should survive. Working memory should stay compact. Ephemeral logs should often be summarized or dropped once they are no longer needed.
If you treat all memory as equally important, the agent eventually drowns in its own transcript.
What to do when you hit the limit
At some point, every serious system hits a context limit or gets close enough that performance degrades before a hard failure.
When that happens, do not just trim tokens randomly. Use a structured response.
1. Identify what is consuming the budget
Break prompt size into components:
- System prompt
- Chat history
- Retrieval context
- Tool output
- Examples
- Reserved output budget
Teams often discover that the biggest consumer is not the source document. It is duplicated instructions, repeated examples, or oversized tool outputs.
2. Drop low-value context first
The first tokens to cut are usually:
- Verbose intermediate logs
- Stale history
- Duplicated instructions
- Low-relevance retrieved chunks
Do not cut the highest-value evidence just to preserve noisy history.
3. Summarize before you truncate blindly
If older context contains important state, summarize it into a structured form before dropping raw text. This preserves continuity without blowing the budget.
4. Re-architect if the pattern is recurring
If you repeatedly hit the limit, the problem is probably not one bad prompt. It is the system design.
Typical fixes:
- Add retrieval instead of stuffing
- Use chunked processing
- Introduce summary state
- Split tasks into multiple calls
- Keep long-term memory outside the live prompt
5. Reserve output tokens deliberately
Many failures happen because teams forget that output also consumes the window. If the prompt already nearly fills the limit, the model has little room left to answer properly.
Always leave explicit output budget, especially for:
- Long-form generation
- Structured JSON
- Tool-using agents
- Multi-step plans
A practical mental model
The best way to think about context windows is not as storage. Think of them as attention budget.
A larger budget lets the model consider more material in one pass. But you still have to decide what deserves to be in that budget.
That leads to a simple operating rule:
- Measure tokens
- Prefer relevant context over maximum context
- Compress state when possible
- Retrieve instead of stuffing
- Design agents so they do not carry their entire life story forever
What this means for production AI apps
In production, context windows are never just a model spec. They are a product design constraint.
If you manage context well, you get:
- Lower cost
- Lower latency
- Better reliability
- Cleaner agent behavior
- More understandable failure modes
If you manage context badly, you get slow assistants, expensive prompts, brittle long-document workflows, and agents that gradually become less useful as they accumulate junk.
The practical takeaway is simple. Long context is powerful, but good systems still rely on selection, compression, and structure. The teams that win with long-context models are usually not the ones who paste the most text into the prompt. They are the ones who decide, carefully and consistently, what the model actually needs to see right now.
That discipline becomes more important, not less, as context windows grow. Bigger windows expand your options, but they also increase the temptation to avoid architectural decisions. In real production systems, context is still something you manage, not something you outsource to the model spec.
Related articles
How LLMs actually work: transformers, tokens, and attention explained (2026)
A practical, deep explanation of how large language models work — covering transformers, tokenisation, attention mechanisms, training, and what this means for builders.
20 min read
The complete prompt engineering guide (2026)
The most comprehensive, practical prompt engineering guide covering zero-shot, few-shot, chain-of-thought, system prompts, and advanced techniques with real examples.
18 min read
Fine-tuning LLMs: complete guide to LoRA, QLoRA, and when to fine-tune (2026)
A practical guide to fine-tuning large language models in 2026. Covers LoRA, QLoRA, dataset creation, and an honest framework for when fine-tuning beats RAG.
16 min read