Open
Description
Please read this first
- Have you read the docs?Agents SDK docs yes
- Have you searched for related issues? Others may have had similar requests yes
Describe the feature
What is the feature you're requesting? How would it work? Please provide examples and details if possible.
Problem Description
The current implementation of OpenAITTSModel
only supports handling a single complete text
per call, as shown below:
async def run(self, text: str, settings: TTSModelSettings) -> AsyncIterator[bytes]:
...
In the upper-layer business logic, LLM usually outputs text incrementally. Once a piece of text is generated, the TTS layer immediately calls run
to synthesize and play it, for example:
for text in stream_output:
async for chunk in tts.run(text, settings):
yield chunk
This approach leads to:
- Each
text
triggers a new and independent OpenAI TTS request. - The resulting playback experience is: noticeable pauses between segments of text.
User Experience Problem
This issue is particularly noticeable in scenarios like LLM streaming conversations or long-form content reading — where the content is supposed to sound continuous, but instead feels unnaturally fragmented due to frequent pauses.
Optimization Goal
Maintain a persistent TTS WebSocket connection.
Whenever new text
is generated by the LLM:
- Directly send the incremental text to the TTS WebSocket stream.
- The backend continuously pushes audio chunks without restarting or reconnecting.
- The audio player can play these chunks seamlessly in real-time, without waiting for new connections or full-text inputs.