Sets the maximum number of tokens the model can generate in a single response.
Max tokens caps the length of the model response. Once the model generates this many tokens, it stops, even mid-sentence. This parameter controls cost (you pay per output token) and prevents runaway responses.
The model generates tokens one at a time until it either produces a stop token (natural end of response) or hits the max_tokens limit. Setting max_tokens to 100 means you will never get more than 100 tokens back, but you might get fewer if the model finishes its thought earlier.
Set it just above your expected response length. For short answers: 50-200. For code generation: 2000-4000. For long-form content: 4000-8000. Always set this explicitly in production to prevent cost surprises.
Setting max_tokens too low and getting cut-off responses. Not setting it at all and getting unexpectedly long (expensive) responses. Confusing max_tokens (output only) with context window (input + output combined).
Varies by model. GPT-4o: up to 16K output, Claude: up to 128K output, Gemini: up to 8K output.