Skip to content

Obvious Pauses Between Text Segments in Current OpenAITTSModel Implementation Affect Speech Fluency #493

Open
@mikuh

Description

@mikuh

Please read this first

  • Have you read the docs?Agents SDK docs yes
  • Have you searched for related issues? Others may have had similar requests yes

Describe the feature

What is the feature you're requesting? How would it work? Please provide examples and details if possible.

Problem Description

The current implementation of OpenAITTSModel only supports handling a single complete text per call, as shown below:

async def run(self, text: str, settings: TTSModelSettings) -> AsyncIterator[bytes]:
    ...

In the upper-layer business logic, LLM usually outputs text incrementally. Once a piece of text is generated, the TTS layer immediately calls run to synthesize and play it, for example:

for text in stream_output:
    async for chunk in tts.run(text, settings):
        yield chunk

This approach leads to:

  1. Each text triggers a new and independent OpenAI TTS request.
  2. The resulting playback experience is: noticeable pauses between segments of text.

User Experience Problem

This issue is particularly noticeable in scenarios like LLM streaming conversations or long-form content reading — where the content is supposed to sound continuous, but instead feels unnaturally fragmented due to frequent pauses.


Optimization Goal

Maintain a persistent TTS WebSocket connection.

Whenever new text is generated by the LLM:

  • Directly send the incremental text to the TTS WebSocket stream.
  • The backend continuously pushes audio chunks without restarting or reconnecting.
  • The audio player can play these chunks seamlessly in real-time, without waiting for new connections or full-text inputs.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions