With a standard API call, you wait until the entire response is generated before anything comes back. That’s fine for background jobs, but your users are staring at a blank screen the whole time. Streaming flips this around. You get tokens the moment they’re produced, so your users see the response being typed out in real time — just like ChatGPT. Under the hood, Mavera uses Server-Sent Events (SSE) to push each token as it’s ready.Documentation Index
Fetch the complete documentation index at: https://docs.mavera.io/llms.txt
Use this file to discover all available pages before exploring further.
Quick Example
Useclient.responses.stream() and iterate over events instead of reading a single response.
client.responses.stream() and handle named events as they arrive.
How Streaming Works
When you stream, the API doesn’t wait to finish generating. Instead, it opens a long-lived HTTP connection and pushes Server-Sent Events — one per token (or small group of tokens). Each event has atype that tells you what happened. The final response.completed event signals the stream is done and includes usage data.
The connection stays open until the model finishes or an error occurs. Your client reads events as they arrive, so there’s no polling.
Streaming doesn’t change what the model generates — you get the exact same output. It only changes when you receive it.
Event Structure
Each SSE event is a named event with atype field. Here are the key events you’ll encounter:
| Event Type | Description |
|---|---|
response.created | Response object created — streaming has started |
response.output_item.added | A new output item (text, function call) has been added |
response.output_text.delta | A text token — read it from event.delta |
response.output_text.done | Text generation for the current item is complete |
response.output_item.done | The current output item is fully complete |
response.completed | Response is finished — includes full usage data |
Building a Chat UI
In a real application you need the full response text after streaming finishes — for storing in a database, passing to the next API call, or displaying in a conversation thread. Accumulate deltas as they arrive.event.delta into your UI state and let your framework re-render. In React, that looks like appending to a useState string inside the loop.
Streaming with Structured Outputs
Structured outputs work with streaming. The JSON arrives token by token just like plain text. You won’t have valid JSON until the stream finishes, so accumulate everything, then parse once at the end.You can show a live JSON preview while streaming by attempting
JSON.parse() on each accumulated chunk. Libraries like partial-json can parse incomplete JSON for real-time UI updates.Streaming with Function Calling
When the model decides to call a tool, the function name and arguments stream in as events. You’ll receiveresponse.function_call_arguments.delta events with argument fragments. Accumulate them the same way you accumulate text content.
Error Handling
Streams can fail mid-way. A network hiccup, a server timeout, or a client disconnect can leave you with a partial response. Here’s how to handle the common cases.Connection Drops and Timeouts
Wrap your stream in a try/catch to handle broken connections gracefully. Decide whether to retry (if idempotent) or surface the partial response to the user.Checklist
Set a timeout on the client
Set a timeout on the client
The OpenAI SDK lets you pass
timeout (in seconds for Python, milliseconds for JS). Without a timeout, a stalled connection can hang forever. 60 seconds is a reasonable default.Watch for empty streams
Watch for empty streams
If the very first event errors out, you’ll get an exception before any content arrives. Handle this the same as a non-streaming API error — retry or surface the error to the user.
Handle partial JSON in structured outputs
Handle partial JSON in structured outputs
If the stream drops while returning JSON, you’ll have invalid JSON. Don’t try to parse it — surface a user-friendly error and retry the request.
Rate limit errors still return 429
Rate limit errors still return 429
If you’re rate-limited, the streaming request fails before any events are sent. You’ll get a
RateLimitError (Python) or a response with status: 429 (JS). Handle this with exponential backoff, same as non-streaming.When to Use Streaming
Streaming isn’t always the right choice. Here’s a quick decision guide:| Use Case | Streaming | Standard |
|---|---|---|
| Chat UIs and conversational apps | Yes — users see responses instantly | No — awkward delay |
| Long-form content (articles, reports) | Yes — show progress on long generations | Depends on context |
| Batch processing and pipelines | No — overhead of event handling isn’t worth it | Yes — simpler code |
| Structured outputs (JSON) | Either — stream for UX, standard for simplicity | Either |
| Function calling | Either — stream to show “thinking” state | Either |
| Webhooks and async workflows | No — you need the full response in one payload | Yes |
See Also
Responses API
Full API reference for responses, including all parameters
Structured Outputs
JSON mode and JSON Schema for typed responses
Error Handling
Complete error codes and retry strategies
Rate Limits
Request limits, headers, and backoff patterns