For the lowest latency and most interactive applications, use the WebSocket API. It lets you stream text input and receive audio chunks in real time over a single persistent connection, which is ideal for chatbots, live assistants, and voice agents. For one-shot requests where you only need a stream of audio bytes back, see Streaming instead.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Establish a connection
Connect to:wss://api.together.ai/v1/audio/speech/websocket
Authentication
- Include your API key as a query parameter:
?api_key=<your_api_key>. - Or use the
Authorizationheader when establishing the WebSocket connection.
Client-to-server messages
Append text to buffer
Commit buffer
Clear buffer
Update session parameters
context_id is specified, all contexts are updated.
Server-to-client messages
Session created
Text received acknowledgment
Audio delta (streaming chunks)
Audio complete
Word timestamps
Sent whenalignment=word is set. Contains word-level timing information for the generated audio.
TTS error
WebSocket example
WebSocket parameters
When establishing a WebSocket connection, you can configure:| Parameter | Type | Description |
|---|---|---|
| model | string | The TTS model to use |
| voice | string | The voice for generation |
| response_format | string | Audio format: mp3, opus, aac, flac, wav, or pcm |
| speed | float | Playback speed (default: 1.0) |
| max_partial_length | integer | Character buffer length before triggering TTS generation |
| sample_rate | integer | The sample rate of the output audio in Hz (e.g., 24000, 44100) |
| language | string | The language or locale code for speech synthesis (e.g., en, fr, es). Locales are supported and must be lowercase (e.g., zh-hk for Cantonese) |
| alignment | string | Controls word-level timestamp generation. Set to word to receive conversation.item.word_timestamps events, or none to disable (default: none) |
| segment | string | Controls how text is segmented before synthesis. Options: sentence (default) splits on sentence boundaries, immediate processes text as soon as it arrives, never waits until buffer is committed |
| extra_params | object | Additional model-specific parameters. Supported fields: |
pronunciation_dict | array | A list of pronunciation rules for specific characters or symbols. Each entry uses the format "<source>/<replacement>" (e.g., ["omg/oh my god"]) to override how the model pronounces matching tokens. |
You can pass these query parameters either in the WebSocket URL (e.g.,
wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy&sample_rate=24000&alignment=word) or dynamically via the tts_session.updated event after the connection is established.Multi-context support
You can manage multiple independent TTS streams over a single WebSocket connection usingcontext_id. This is useful for applications handling multiple simultaneous conversations or characters.
- Add
context_idto any client message to route it to a specific context. - Messages without
context_iduse the"default"context. - Each context maintains its own text buffer and voice settings.
- Cancel a specific context with the
context.cancelmessage type. - Send
tts_session.updatedwithout acontext_idto update all contexts at once. - Maximum 100 contexts per connection.
context.cancelled message:
See also
- Text-to-speech overview for parameters, response formats, voices, and pricing.
- Streaming for HTTP-based streaming and raw byte output.