Developer Reference

LLM Parameters Guide

Every parameter you can tweak when calling an LLM API, explained with practical defaults and real trade-offs. Stop guessing, start configuring with confidence.

Sampling Parameters

Control the randomness and diversity of model output

Temperature

Controls randomness in model output. Lower = more focused, higher = more creative.

Top-P (Nucleus Sampling)

Limits token selection to the smallest set whose cumulative probability exceeds P.

Frequency Penalty

Penalizes tokens proportionally to how often they appeared, reducing repetition.

Length & Limits

Set boundaries on input and output sizes

Max Tokens (Max Output Length)

Sets the maximum number of tokens the model can generate in a single response.

Context Window

The total number of tokens (input + output) the model can process at once.

Capabilities

Enable specific model features and output formats

Function Calling (Tool Use)

Lets the model invoke external functions/APIs by outputting structured calls.

JSON Mode (Structured Output)

Forces the model to output valid JSON instead of free-form text.

Streaming

Receive model output token-by-token as it generates.

Advanced Controls

Fine-tune model behavior for specialized use cases

System Prompt

Sets the model personality, behavior rules, and context before the conversation starts.

Reasoning Effort (Thinking Budget)

Controls how much internal reasoning the model does before answering.

Prompt Caching

Reuses previously processed prompt prefixes to reduce latency and cost.

Frequently Asked Questions

Temperature and max_tokens are the two parameters that matter most for most applications. Temperature controls output randomness (lower for code, higher for creative text), and max_tokens prevents runaway responses and controls cost. Everything else is situational.

Pick one, not both. OpenAI explicitly recommends adjusting temperature OR top-p, not both simultaneously. Temperature is more intuitive for most developers. Use top-p when you want dynamic candidate pool sizing based on model confidence.

Use temperature 0-0.2 for code generation. You want deterministic, correct output, not creative variations. Some developers use temperature 0 (greedy decoding) for maximum consistency, though a small amount of randomness (0.1) can sometimes help avoid getting stuck in local optima.

No. Core parameters like temperature and max_tokens are nearly universal, but advanced features vary. Function calling requires GPT-4o+, Claude 3+, or Gemini. Reasoning effort is only on thinking-capable models (o-series, Claude with thinking, DeepSeek R1). Prompt caching availability and pricing differs by provider.