Context window
KON-tekst WIN-doh
The maximum amount of text an LLM can process in a single request. Measured in tokens. Bigger windows handle more information at once.
The context window is the total amount of text an LLM can hold in a single conversation. Everything goes in the window: your system prompt, the conversation history, any documents you attach, and the model's response. If the total exceeds the window, the model either refuses the request or drops earlier content.
Context windows have grown fast. GPT-3 had 4,096 tokens (about 3 pages). GPT-4 jumped to 128,000 tokens (roughly a 300-page book). Claude offers 200,000 tokens. Google's Gemini 1.5 Pro claims 2 million tokens.
But bigger is not always better. Larger context windows cost more per request. Models can also lose accuracy on information buried in the middle of very long contexts, a phenomenon researchers call "lost in the middle." For most practical applications, RAG is more cost-effective than stuffing everything into the context window.
Examples
A developer analyzes an entire codebase.
Claude's 200k token context window can hold roughly 150,000 words of code. A developer pastes 40 files from a Next.js project and asks "Find all the places where we handle authentication." The model reads the entire codebase in one pass.
A legal team reviews a contract.
A 50-page merger agreement is roughly 25,000 tokens. It fits easily in modern context windows. The lawyer pastes the full contract and asks the model to identify non-standard indemnification clauses.
Context window versus RAG tradeoffs.
A support bot could stuff the entire 500-page product manual (250k tokens) into Gemini's context window. Or it could use RAG to retrieve only the 3 most relevant pages (1,500 tokens) and use a smaller, cheaper model. The RAG approach costs 99% less per query.
In practice
Read more on the blog
Related terms
The smallest unit of text an LLM processes. Roughly 4 characters or 3/4 of a word. Tokens determine cost and context limits.
Fetching relevant data and feeding it to an LLM so the response is grounded in real, current information instead of training data alone.
A neural network trained on massive text data to generate and understand language. The technology behind ChatGPT, Claude, and Gemini.
Writing instructions that get the best output from an AI model. The difference between a useless response and a useful one.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.