← All articles

What is an LLM context window?

May 22, 2026 · 14 min read

An LLM context window is the maximum amount of text a language model can process in a single inference pass, measured in tokens. It is the model's working memory for one request: system instructions, tool definitions, attached files, and every prior turn in a chat all compete for the same budget. When the window fills up, something has to go—older messages, attachments, or quality.

Understanding the window is practical engineering, not trivia. It drives cost, latency, what the model can remember, and when long chats start to fail.

Tokens, input, and output

Tokens are the smallest units models read and bill. Roughly one token equals three to four characters of English prose, but code, JSON, and non-Latin scripts often consume more tokens per visible character. Providers count tokens on both sides of the call.

Term Meaning
Input tokens Everything you send: system prompt, history, tool schemas, RAG chunks, images (converted to tokens)
Output tokens The model's generated reply; capped separately by max output on most APIs
Context window Upper bound on input the model accepts in one call (output may or may not count against the same pool, depending on provider)
Working memory Not durable storage—closing the chat or starting a new session clears it unless the product adds external memory

Advertised vs effective context. Vendors publish large limits—128K, 1M, even 10M tokens—but model quality often degrades well before the hard cutoff. Retrieval accuracy can drop for facts buried in the middle of very long prompts ("lost in the middle"). Treat the published number as a ceiling, not a guarantee that every token is equally useful.

Context windows among top 5 popular models

The table below covers five families teams reach for most often: OpenAI (ChatGPT/API), Anthropic (Claude), Google (Gemini), Meta (open-weight Llama), and DeepSeek (cost-efficient API). Figures reflect official documentation as of May 2026; limits, tiers, and pricing change—verify on each provider's models page before you design production workloads.

Model (representative) Provider Input context Max output Notes
GPT-5.5 OpenAI 1M tokens 128K tokens Flagship reasoning/coding model
GPT-5.4-mini / nano OpenAI 400K tokens 128K tokens Lower cost and latency variants
Claude Sonnet 4.6 Anthropic 1M tokens 64K tokens Balanced speed and intelligence
Gemini 2.5 Pro Google 1,048,576 tokens (~1M) 65,535 tokens Strong long-document and multimodal input
Llama 4 Scout Meta (open-weight) 10M tokens Model-dependent Self-hosted; practical length limited by GPU memory
DeepSeek V4 Flash / Pro DeepSeek 1M tokens 384K tokens Low API cost; deepseek-chat aliases V4 Flash

OpenAI. Frontier models in the GPT-5 family advertise up to 1M input tokens with 128K max output on flagship tiers; smaller variants trade context for price (for example 400K on mini/nano). ChatGPT UI limits may differ from the API. Legacy GPT-4o remains at 128K context for teams still pinned to that model ID.

Anthropic. Claude Sonnet 4.6 and Claude Opus 4.7 support 1M-token context with up to 128K output on Opus and 64K on Sonnet. Claude Haiku 4.5 stays at 200K input—enough for many agent loops, not full-book ingestion. Long-context tiers may carry pricing surcharges above 200K tokens on some platforms.

Google. Gemini 2.5 Pro ships with a ~1M-token input window and 65,535 max output tokens in the API. Gemini is a common choice when the task is "load a large corpus once" rather than many small calls. Check Google's models documentation for Gemini 3.x variants if you need even larger windows on newer releases.

Meta (Llama 4). Llama 4 Scout is the headline 10M-token open-weight model—orders of magnitude beyond Llama 3's 128K. That limit assumes you run the model yourself (or use a host that provisions enough KV-cache memory). On a single GPU, effective context is often far smaller than 10M; the number describes what the architecture targets, not what fits on a laptop.

DeepSeek. DeepSeek V4 models offer 1M input and up to 384K output at aggressive API pricing—popular for high-volume reasoning and coding workloads. Older DeepSeek R1 deployments at 128K may still appear in legacy configs; migrate to V4 IDs when upgrading.

How context grows with each chat message

Multi-turn chat feels incremental—you send one new message—but the client usually resends the full conversation on every turn unless it explicitly trims or summarizes. Context is cumulative.

Turn 1:  2,000 tokens in  →  500 tokens out
Turn 2:  2,000 + 500 + 1,500 (new user msg) = 4,000 in  →  600 out
Turn 3:  4,000 + 600 + 800 (new) = 5,400 in  →  ...

Each API call typically includes:

  1. System prompt — persona, rules, project instructions (resent every turn)
  2. Tool / function definitions — JSON schemas for OpenAI functions, MCP tools, etc. (fixed overhead on every call)
  3. Retrieved or attached context — RAG chunks, @-referenced files, MCP resources
  4. Full message history — prior user and assistant turns (unless the client drops old ones)
  5. Current user message
  6. Assistant replies — after generation, they become history for the next turn

So a 30-turn debugging session is not "30 small messages." It is one growing payload where turn 30 pays for turns 1–29 again, plus any files still attached.

What clients do when space runs out

  • Truncation — drop oldest messages when near the limit
  • Summarization — compress early turns into a short recap (Cursor, ChatGPT, and others do variants of this)
  • New session — fresh window; no memory unless the product stores facts externally

Separate sessions matter. A new chat resets the window. Product-level "memory" features store snippets outside the token budget—they are not the same as a larger context window.

For structured context that loads on demand instead of living permanently in the system prompt, see What is the MCP Protocol?—especially resources for read-only attachments and tools for actions without inlining entire APIs into every request.

Tips for using context economically

A larger window is not free: you pay in tokens, latency, and often quality as prompts grow. These practices keep assistants fast and reliable.

Before the conversation

  • Start a new chat when the topic shifts. Do not continue a 50-turn bug hunt when you switch to unrelated feature work—the old turns add cost and noise.
  • Put stable rules in the repo, not in repeated prompts. Cursor rules, AGENTS.md, and spec files survive across sessions.
  • Register only the tools you need. Each MCP or function definition adds fixed tokens to every call; a "god server" with forty tools taxes every prompt.

During the conversation

  • Scope attachments. Paste the function, not the repository. Reference file paths and line ranges when the IDE can index them.
  • Summarize and continue. Ask for a compact handoff block (decisions, open questions, next steps), then open a fresh chat with that summary as the seed.
  • One task per thread. Matches disciplined vibe coding loops: small prompts, immediate verification, less cruft in history.
  • Prefer concise output formats. Request JSON, bullet lists, or diffs when a long prose answer wastes tokens you will resend on the next turn.

Architecture-level

  • RAG over dump. Retrieve top-k relevant chunks instead of loading entire document sets into the prompt. Input then scales with relevance, not corpus size.
  • MCP resources on demand. Fetch OpenAPI specs, schemas, or runbooks when needed; do not inline 200-page PDFs in the system prompt.
  • External memory. Tickets, ADRs, and specs live in Jira, GitHub, or Notion—not in chat history. Link to them; let tools pull fresh state.
  • Right-size the model. Do not route a rename refactor through a 1M-token tier if a 200K model suffices. Match window and cost to task depth.
Technique Effect
New chat + summary handoff Drops stale turns; keeps decisions
Partial paste / file scope Less input per turn
Fewer registered tools Lower fixed overhead every call
RAG vs full document Input scales with relevance, not file size
Repo-local specs (SDD) Rules not duplicated in every message

Summary

The context window is the per-call token budget that bounds what a model can see at once. In chat products, that budget fills cumulatively—system prompt, tools, attachments, and full history are resent unless the client trims them. Top commercial models now span 128K to 1M+ tokens (and far more for self-hosted Llama 4 Scout), but effective usefulness often peaks well below the advertised maximum. Design workflows so durable knowledge lives outside the window—in repos, specs, RAG indexes, and MCP resources—and treat each chat as a scoped, disposable working session.

Next steps

Related articles