The total number of tokens (input + output) the model can process at once.
The context window is the model's total memory for a single request. It includes everything: your system prompt, conversation history, user message, and the generated response. Exceed it and the request fails or older context gets truncated.
Transformers process all tokens in the context simultaneously using self-attention. Every token can attend to every other token. This is powerful but computationally expensive, scaling quadratically with sequence length. Larger context windows need more memory and compute.
For chat: 8-32K is usually sufficient. For document analysis and RAG: 32-128K. For codebase-level tasks: 128K-1M. Monitor your actual token usage because most conversations use far less than the maximum.
Assuming larger context window means better results. Models often perform worse on information buried in the middle of very long contexts (the "lost in the middle" problem). Not accounting for output tokens eating into the context budget.
GPT-4o: 128K, Claude Opus/Sonnet: 200K, Gemini 2.5 Pro: 1M, DeepSeek R1: 128K. 1K tokens is roughly 750 words.