tokenscontext windowprompt engineering

Token limits and context windows: how to manage them effectively (2026)

What tokens actually are, how context windows behave in production, and the practical patterns teams use to manage long prompts, RAG pipelines, and agent loops.

By Knovo Team2026-04-1310 min readLast verified 2026-04-13

Context windows are one of the most important and most misunderstood constraints in LLM systems. Teams often focus on the headline number: 128k, 200k, 1M. Then they design prompts as if that number means "the model can understand everything I put in." In practice, that is not how good systems are built. Large windows are useful, but they are still budgets. They affect cost, latency, retrieval design, prompt shape, and failure behavior.

This matters because context problems rarely announce themselves clearly. A request might fail because it exceeds the limit. Or it might succeed and still get worse because the useful evidence was buried in noise. Or an agent might become expensive and confused after twenty turns because every old tool result stayed in the prompt. Token management is therefore not a side topic. It is part of production architecture.

What tokens actually are

Tokens are the units a model actually processes. They are not the same thing as words.

That difference matters immediately because most developers estimate prompt size by intuition. They count characters, pages, or rough word totals. The model does none of those things. It consumes tokenized text.

Tokenization and BPE

Most modern language model tokenizers use subword approaches derived from methods like byte pair encoding, often shortened to BPE.

The practical idea is simple:

  1. common whole words may become one token
  2. rare words may become multiple tokens
  3. punctuation, numbers, whitespace patterns, and code fragments may each tokenize differently

This is why token != word.

For example:

  1. a short common English word may be one token
  2. a URL may become many tokens
  3. a JSON object may tokenize differently from the same information in prose
  4. a code snippet may be denser in tokens than it looks

That is why cost estimation, truncation behavior, and context planning should always start with token count, not rough page count.

Why subwords exist at all

Subword tokenization is a compromise between two bad extremes.

  1. word-level tokenization is too rigid for rare words, typos, code, and multilingual text
  2. character-level tokenization is too long and inefficient

BPE-style tokenization gives models a usable middle ground. The cost is that token counts are often unintuitive to humans.

Why this matters in production

Tokenization affects:

  1. prompt cost
  2. context fit
  3. latency
  4. truncation risk

If you are building assistants, extractors, or agents, token counting is not optional hygiene. It is one of the few ways to understand why a prompt is getting slower, more expensive, or more brittle over time.

Why the same content can tokenize very differently

A paragraph of prose, a JSON blob, a markdown table, and a code snippet can all look similarly sized to a human and still produce very different token counts.

That is why teams get surprised by:

  1. long URLs
  2. SQL queries
  3. stack traces
  4. source code
  5. copied logs

These often consume more context than expected. In production, this is one reason tool outputs and raw documents should almost never be pasted into prompts blindly.

A simple way to measure it

The most useful habit is to count tokens before you trust your intuition.

# pip install -U tiktoken
import tiktoken
 
enc = tiktoken.get_encoding("cl100k_base")
 
samples = [
    "Explain context windows in plain English.",
    '{"task":"summarize","audience":"cto","max_items":5}',
    "SELECT * FROM invoices WHERE customer_id = 42 AND status = 'open';",
]
 
for text in samples:
    tokens = enc.encode(text)
    print("-" * 40)
    print(text)
    print("chars:", len(text))
    print("tokens:", len(tokens))

This is not about exact accounting for every vendor model. It is about making token size visible before prompt cost and prompt failure become surprises.

Token Density by Content Type (per 100 characters)CONTENT TYPETOKENS CONSUMEDPlain English~25 tokMarkdown~33 tokJSON / Structured~38 tokURLs / Code~50 tok02550Approximate values. URLs and code can cost 2× more tokens than equivalent plain prose.
Token density varies significantly by content type — counting words or characters instead of tokens leads to prompt budget surprises

Context window sizes across major models

Context window size is the maximum amount of input and output context a model can handle in a request cycle. The exact accounting differs by provider and feature surface, but the basic idea is stable: all the tokens you send, plus room for the model's answer, have to fit inside a budget.

As of April 13, 2026, official docs indicate:

  1. OpenAI's GPT-5.4 (current flagship as of March 2026) has a 272,000-token standard context window and 128,000 max output tokens. An experimental 1M-token window is available via the Codex API at a 2× input surcharge.
  2. Anthropic's docs state Claude Opus 4.6 and Claude Sonnet 4.6 have a 1M-token context window, while smaller Claude models may use 200k.
  3. Google's Gemini docs state many Gemini models support 1M or more tokens of context, and their long-context guide frames 1M-token workflows as a core capability.

What these numbers mean in practice

The headline numbers are useful, but they are easy to misread.

A large context window does not mean:

  1. every token is equally useful
  2. long prompts are free
  3. retrieval is obsolete
  4. prompt structure no longer matters

What it really means is that you have more room to design the system. Sometimes that room lets you include whole documents or long chat histories directly. Sometimes it just gives you more margin before a bad prompt strategy breaks.

Bigger windows change architecture, not physics

Large windows make some patterns more viable:

  1. whole-document summarization
  2. longer conversation continuity
  3. larger retrieval candidate sets
  4. more many-shot examples

But they do not eliminate tradeoffs. Long context still increases cost, still increases latency, and still creates competition among instructions, examples, retrieved chunks, and prior turns.

Output budget matters too

One common mistake is to think of the context window as pure input capacity.

It is not. The model still needs room to answer. If you fill almost the entire window with input, you create a higher chance of truncated or low-quality output, especially for long-form generation or multi-step reasoning.

A practical prompt budget usually needs space for:

  1. instructions
  2. user task
  3. context
  4. output

What happens at the limit

The most obvious failure is hard overflow.

Some APIs fail the request when prompt plus output budget exceeds the context window. OpenAI's Responses API docs explicitly document truncation behavior options and note that with truncation disabled, oversized input results in an error. Anthropic's context-window docs say newer Claude models return a validation error when prompt and output tokens exceed the context window rather than silently truncating.

That is the easy case because you notice it.

The harder case is soft degradation before the hard limit.

Truncation

When truncation is enabled or implemented at the application layer, something gets dropped.

The problem is not only losing tokens. It is losing the wrong tokens:

  1. early instructions
  2. old but still important conversation turns
  3. retrieval evidence that explains edge cases
  4. tool outputs that contain critical state

This is why naive truncation creates weird behavior. The model still answers, but it answers without the pieces that made the answer safe or relevant.

Lost in the middle

Even when everything fits, long prompts can still perform badly because of lost-in-the-middle behavior. The practical observation is that models often use beginning and end sections of long prompts more reliably than material buried deep in the middle.

This is not a reason to avoid long context altogether. It is a reason to structure it.

If the most important evidence is hidden inside a giant block of text, the model may underuse it even though it technically fit inside the window.

Degraded attention

Long contexts increase competition for attention.

As prompts grow, the model has to balance:

  1. system instructions
  2. user requests
  3. examples
  4. retrieved passages
  5. tool traces
  6. earlier conversation state

That is why larger context can make answers worse when the extra text is low value. More tokens are not automatically more signal.

Why this feels random in production

Context failures often look inconsistent from the outside. One long prompt works. Another similar prompt degrades badly. That usually happens because the system has entered a soft-failure zone where arrangement, relevance, and output budget matter more than the raw fact that the prompt technically fits.

Strategies for managing long contexts

Good long-context design is usually about deciding what not to include.

Chunking

Chunking is the basic pattern for long documents and retrieval pipelines.

Instead of sending one huge block, split the material into smaller semantically coherent units. That helps with:

  1. retrieval precision
  2. summarization quality
  3. partial processing
  4. debugging

This is one reason RAG systems work at all. They do not try to make every prompt carry the whole corpus.

Sliding windows

Sliding windows are useful for conversations and agent loops.

The basic idea is:

  1. keep the most recent turns in raw form
  2. summarize older state
  3. drop low-value history

This gives you continuity without replaying the system's entire life story every turn.

Summarization

Summarization is a compression layer.

If a conversation or long workflow has produced many tokens, a structured summary can preserve:

  1. decisions made
  2. constraints
  3. open questions
  4. user preferences
  5. pending actions

This is usually better than keeping dozens of old turns verbatim.

Selective retrieval

Selective retrieval is often more important than raw context size.

Instead of putting everything in the prompt, retrieve only the most relevant pieces for the current question. This is usually the right approach when:

  1. the corpus is large
  2. the query is specific
  3. cost and latency matter
  4. answer quality depends on relevance more than breadth

This is exactly why retrieval architecture stays important even in a long-context world.

Long-running chats and agent loops

The moment a system becomes multi-turn, context management becomes much harder.

A long-running assistant may accumulate:

  1. old user questions
  2. stale assistant responses
  3. tool traces
  4. plans
  5. intermediate summaries

If all of that stays in the live prompt forever, the system gets slower, more expensive, and usually less reliable. This is why agent systems often need separate layers for current context, summarized durable state, and discarded low-value history.

Measuring context health in production

One reason context problems linger is that teams often do not instrument them.

A good production system should track at least:

  1. average prompt tokens by route
  2. average output tokens by route
  3. truncation or overflow errors
  4. retrieval payload size
  5. latency as prompt size grows

These metrics are useful because they make hidden prompt growth visible. Many teams only notice context problems after answers get worse. By then the system has often been carrying prompt debt for weeks.

What to watch for

The most common warning signs are:

  1. rising latency with no model change
  2. rising cost with similar user traffic
  3. more inconsistent long-form answers
  4. agents becoming less reliable over long sessions

Those are often context-window problems before they are model-quality problems.

Practical patterns

The most useful production patterns are less glamorous than the model headlines.

Prompt compression

Prompt compression means making the prompt denser without losing what matters.

This can include:

  1. removing redundant instructions
  2. shortening examples
  3. deleting stale tool output
  4. summarizing low-value context

Compression is not about making every prompt tiny. It is about protecting the high-value tokens.

A practical rule for prioritization

If everything is included, nothing is prioritized.

One reliable ordering rule is:

  1. keep stable constraints
  2. keep the current task explicit
  3. keep only the most relevant evidence
  4. summarize older state
  5. remove decorative or repeated text first

Context prioritization

Not all tokens deserve equal status.

A practical prompt often needs a priority order:

  1. system or developer constraints
  2. current task instructions
  3. highly relevant retrieved evidence
  4. recent conversation state
  5. lower-value historical context

If you do not define this hierarchy somewhere in the system, the model will inherit your clutter.

When to use RAG vs long context

The real question is not "RAG or long context?" It is "what makes the system most reliable for this workload?"

Use long context directly when:

  1. the document set is small enough
  2. whole-document reasoning matters
  3. the cost is acceptable
  4. the workflow benefits from seeing the full source at once

Use RAG when:

  1. the corpus is large
  2. only a few sections are relevant to each query
  3. latency and cost matter
  4. you want inspectable retrieval and citations

In practice, hybrid designs are common. A system may use RAG to narrow the material, then use a longer context window to reason across the retrieved set. That is often better than either extreme alone.

A simple rule of thumb

If the answer usually lives in a small subset of a large corpus, start with RAG.

If the answer usually depends on relationships across one or a few long documents, long-context prompting may be simpler.

If you have both conditions, the practical pattern is:

  1. retrieve first
  2. narrow the material
  3. use long-context reasoning on the narrowed set

When long context is worth the cost

Long context is most worth paying for when the answer genuinely depends on relationships across distant parts of a document or conversation.

Examples include:

  1. comparing sections of a long contract
  2. tracking a multi-turn investigation
  3. reasoning across multiple retrieved passages that must be interpreted together

If the task does not require that kind of cross-span reasoning, selective retrieval and compression are usually the better default.

For the underlying model mechanics behind these tradeoffs, see How LLMs actually work. For retrieval-specific design, see How to build a RAG system from scratch. For the prompt-side design layer, see The complete prompt engineering guide.

Context Window Management StrategiesChunkingSplit into semantic unitsLarge document corporaRAG pipelinesPartial processingSliding WindowKeep recent, summarize oldLong conversationsAgent loopsSession continuitySelective RetrievalFetch only relevant chunksLarge corpus, specific queryCost & latency sensitiveInspectable citationsSummarizationCompress old stateMulti-turn workflowsAgent memoryPreserving decisionsHybrid designs are common — retrieve first, then reason over the narrowed set with long context
Four strategies for managing long contexts — the right choice depends on corpus size, query specificity, and cost tolerance

What this means

Token limits and context windows are not just API trivia. They shape how reliable, fast, and affordable your system becomes.

The practical lesson is straightforward:

  1. count tokens instead of guessing
  2. do not treat the full context window as a target to fill
  3. preserve the most useful evidence
  4. summarize or retrieve instead of stuffing everything
  5. choose between long context and RAG based on workload, not hype

Large windows are useful because they give you more options. Good systems still win by selecting, compressing, and prioritizing context deliberately. That is the difference between a prompt that merely fits and a system that actually works well at scale.

Teams that manage context well usually get three benefits at once: better answer quality, lower spend, and fewer weird edge-case failures. That is why token and context management should be treated as a core engineering discipline, not just a prompt-writing detail.

In practice, that discipline compounds over time. Small prompt improvements made early usually prevent much larger reliability, latency, and cost problems from accumulating later in the system overall.

Related articles