Receive model output token-by-token as it generates.
Streaming sends each generated token to your application as soon as it is produced, rather than buffering the entire response. This dramatically reduces perceived latency because users see output appearing within milliseconds.
The API uses Server-Sent Events (SSE) to push incremental chunks. Each chunk typically contains one or a few tokens. Your client processes these chunks as they arrive. The stream ends with a [DONE] signal or equivalent.
Any user-facing chat interface where perceived speed matters. Long-form generation where waiting would frustrate users. Real-time applications where you want to start processing partial output immediately.
Not implementing proper error handling for stream disconnections. Trying to parse incomplete JSON from a streaming response before it finishes. Not realizing streaming does not change token pricing, just delivery timing.
Supported by all major providers. OpenAI: stream=true, Anthropic: stream=true, Google: stream=true. Default is non-streaming.