Length & Limits

Context Window

The total number of tokens (input + output) the model can process at once.

What it does

The context window is the model's total memory for a single request. It includes everything: your system prompt, conversation history, user message, and the generated response. Exceed it and the request fails or older context gets truncated.

How it works

Transformers process all tokens in the context simultaneously using self-attention. Every token can attend to every other token. This is powerful but computationally expensive, scaling quadratically with sequence length. Larger context windows need more memory and compute.

When to use it

For chat: 8-32K is usually sufficient. For document analysis and RAG: 32-128K. For codebase-level tasks: 128K-1M. Monitor your actual token usage because most conversations use far less than the maximum.

Common mistakes

Assuming larger context window means better results. Models often perform worse on information buried in the middle of very long contexts (the "lost in the middle" problem). Not accounting for output tokens eating into the context budget.

Default values by provider

GPT-4o: 128K, Claude Opus/Sonnet: 200K, Gemini 2.5 Pro: 1M, DeepSeek R1: 128K. 1K tokens is roughly 750 words.

Related parameters

Max Tokens (Max Output Length)

Sets the maximum number of tokens the model can generate in a single response.

Max Tokens (Max Output Length)Function Calling (Tool Use)