Reuses previously processed prompt prefixes to reduce latency and cost.
Prompt caching stores the computed key-value (KV) cache from previously processed prompt prefixes. When a subsequent request shares the same prefix, the provider skips re-processing those tokens, saving time and money.
When you send a request, the provider checks if the beginning of your prompt matches a recently cached prefix. If it does, it loads the cached computation and only processes new tokens. This can cut time-to-first-token by 50-80% and input cost by 50-90%.
Applications with consistent system prompts across requests. Multi-turn conversations where history grows but the system prompt stays the same. Batch processing with shared instructions. RAG applications with the same document context.
Expecting caching to work when the prompt prefix changes slightly. Not placing the stable part at the beginning (caching works on prefixes only). Assuming cache entries persist indefinitely (they expire after minutes to hours).
Anthropic: automatic, 90% discount on cached tokens. OpenAI: automatic on GPT-4o+, 50% discount. Google: context caching API.