Advanced Controls

Prompt Caching

Reuses previously processed prompt prefixes to reduce latency and cost.

What it does

Prompt caching stores the computed key-value (KV) cache from previously processed prompt prefixes. When a subsequent request shares the same prefix, the provider skips re-processing those tokens, saving time and money.

How it works

When you send a request, the provider checks if the beginning of your prompt matches a recently cached prefix. If it does, it loads the cached computation and only processes new tokens. This can cut time-to-first-token by 50-80% and input cost by 50-90%.

When to use it

Applications with consistent system prompts across requests. Multi-turn conversations where history grows but the system prompt stays the same. Batch processing with shared instructions. RAG applications with the same document context.

Common mistakes

Expecting caching to work when the prompt prefix changes slightly. Not placing the stable part at the beginning (caching works on prefixes only). Assuming cache entries persist indefinitely (they expire after minutes to hours).

Default values by provider

Anthropic: automatic, 90% discount on cached tokens. OpenAI: automatic on GPT-4o+, 50% discount. Google: context caching API.

Related parameters

System Prompt

Sets the model personality, behavior rules, and context before the conversation starts.

Reasoning Effort (Thinking Budget)

Controls how much internal reasoning the model does before answering.

Reasoning Effort (Thinking Budget)Frequency Penalty