Length & Limits

Max Tokens (Max Output Length)

Sets the maximum number of tokens the model can generate in a single response.

What it does

Max tokens caps the length of the model response. Once the model generates this many tokens, it stops, even mid-sentence. This parameter controls cost (you pay per output token) and prevents runaway responses.

How it works

The model generates tokens one at a time until it either produces a stop token (natural end of response) or hits the max_tokens limit. Setting max_tokens to 100 means you will never get more than 100 tokens back, but you might get fewer if the model finishes its thought earlier.

When to use it

Set it just above your expected response length. For short answers: 50-200. For code generation: 2000-4000. For long-form content: 4000-8000. Always set this explicitly in production to prevent cost surprises.

Common mistakes

Setting max_tokens too low and getting cut-off responses. Not setting it at all and getting unexpectedly long (expensive) responses. Confusing max_tokens (output only) with context window (input + output combined).

Default values by provider

Varies by model. GPT-4o: up to 16K output, Claude: up to 128K output, Gemini: up to 8K output.

Related parameters

Context Window

The total number of tokens (input + output) the model can process at once.

Top-P (Nucleus Sampling)Context Window