WebSocket API - Together AI docs

For the lowest latency and most interactive applications, use the WebSocket API. It lets you stream text input and receive audio chunks in real time over a single persistent connection, which is ideal for chatbots, live assistants, and voice agents. For one-shot requests where you only need a stream of audio bytes back, see Streaming instead.

The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.

Establish a connection

Connect to: wss://api.together.ai/v1/audio/speech/websocket

Authentication

Include your API key as a query parameter: ?api_key=<your_api_key>.
Or use the Authorization header when establishing the WebSocket connection.

Client-to-server messages

Append text to buffer

{
  "type": "input_text_buffer.append",
  "text": "Hello, this is a test sentence."
}

Appends text to the input buffer. Text is buffered until sentence completion or maximum length is reached.

Commit buffer

{
  "type": "input_text_buffer.commit"
}

Forces processing of all buffered text. Use this at the end of your input stream.

Clear buffer

{
  "type": "input_text_buffer.clear"
}

Clears all buffered text without processing (except text already being processed by the model).

Update session parameters

{
  "type": "tts_session.updated",
  "session": {
    "voice": "new_voice_id"
  }
}

Updates TTS session settings like voice in real time. If no context_id is specified, all contexts are updated.

Server-to-client messages

Session created

{
  "event_id": "uuid-string",
  "type": "session.created",
  "session": {
    "id": "session-uuid",
    "object": "realtime.tts.session",
    "modalities": ["text", "audio"],
    "model": "canopylabs/orpheus-3b-0.1-ft",
    "voice": "tara"
  }
}

Text received acknowledgment

{
  "type": "conversation.item.input_text.received",
  "text": "Hello, this is a test sentence."
}

Audio delta (streaming chunks)

{
  "type": "conversation.item.audio_output.delta",
  "item_id": "tts_1",
  "delta": "base64-encoded-audio-chunk"
}

Audio complete

{
  "type": "conversation.item.audio_output.done",
  "item_id": "tts_1"
}

Word timestamps

Sent when alignment=word is set. Contains word-level timing information for the generated audio.

{
  "type": "conversation.item.word_timestamps",
  "item_id": "tts_1",
  "words": ["Hello", "world"],
  "start_seconds": [0.0, 0.4],
  "end_seconds": [0.4, 0.8]
}

TTS error

{
  "type": "conversation.item.tts.failed",
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "error_code"
  }
}

WebSocket example

import asyncio
import aiohttp
import json
import base64
import os


async def generate_speech():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = (
        "wss://api.together.ai/v1/audio/speech"
        "/websocket?model=hexgrad/Kokoro-82M"
        "&voice=af_alloy"
        "&response_format=pcm"
        "&sample_rate=24000"
    )

    headers = {"Authorization": f"Bearer {api_key}"}

    text_chunks = [
        "Hello, this is a test.",
        "This is the second sentence.",
        "And this is the final one.",
    ]

    audio_chunks = []

    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.ws_connect(url) as ws:
            # Wait for session.created
            msg = await ws.receive()
            session_data = json.loads(msg.data)
            print(f"Session created: {session_data['session']['id']}")

            async def send_text():
                for chunk in text_chunks:
                    await ws.send_json(
                        {
                            "type": "input_text_buffer.append",
                            "text": chunk,
                        }
                    )
                    print(f"Sent: {chunk}")
                    await asyncio.sleep(0.5)
                await ws.send_json({"type": "input_text_buffer.commit"})
                print("Committed")

            async def receive_audio():
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        mtype = data.get("type", "")

                        if mtype == "conversation.item.audio_output.delta":
                            chunk = base64.b64decode(data.get("delta", ""))
                            audio_chunks.append(chunk)

                        elif mtype == "conversation.item.word_timestamps":
                            words = data.get("words", [])
                            starts = data.get("start_seconds", [])
                            stamps = list(
                                zip(words, [f"{s:.2f}s" for s in starts])
                            )
                            print(f"  timestamps: {stamps}")

                        elif mtype in (
                            "error",
                            "conversation.item.tts.failed",
                        ):
                            err = data.get(
                                "error",
                                data.get("message"),
                            )
                            print(f"Error: {err}")
                            return

                    elif msg.type in (
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break

            send_task = asyncio.create_task(send_text())
            recv_task = asyncio.create_task(receive_audio())

            await send_task

            # Wait up to 10s for audio to stop arriving
            deadline = asyncio.get_event_loop().time() + 10
            while asyncio.get_event_loop().time() < deadline:
                await asyncio.sleep(0.1)
                n = len(audio_chunks)
                await asyncio.sleep(0.3)
                if len(audio_chunks) == n:
                    break

            recv_task.cancel()
            try:
                await recv_task
            except asyncio.CancelledError:
                pass

    if audio_chunks:
        pcm = b"".join(audio_chunks)
        with open("output.pcm", "wb") as f:
            f.write(pcm)
        print(
            f"\nAudio saved to output.pcm ({len(pcm):,} bytes, "
            f"{len(pcm)/48000:.1f}s at 24kHz)"
        )
        print("Play with: ffplay -f s16le -ar 24000 output.pcm")
    else:
        print("No audio received")


asyncio.run(generate_speech())

WebSocket parameters

When establishing a WebSocket connection, you can configure:

Parameter	Type	Description
model	string	The TTS model to use
voice	string	The voice for generation
response_format	string	Audio format: `mp3`, `opus`, `aac`, `flac`, `wav`, or `pcm`
speed	float	Playback speed (default: 1.0)
max_partial_length	integer	Character buffer length before triggering TTS generation
sample_rate	integer	The sample rate of the output audio in Hz (e.g., `24000`, `44100`)
language	string	The language or locale code for speech synthesis (e.g., `en`, `fr`, `es`). Locales are supported and must be lowercase (e.g., `zh-hk` for Cantonese)
alignment	string	Controls word-level timestamp generation. Set to `word` to receive `conversation.item.word_timestamps` events, or `none` to disable (default: `none`)
segment	string	Controls how text is segmented before synthesis. Options: `sentence` (default) splits on sentence boundaries, `immediate` processes text as soon as it arrives, `never` waits until buffer is committed
extra_params	object	Additional model-specific parameters. Supported fields:
`pronunciation_dict`	array	A list of pronunciation rules for specific characters or symbols. Each entry uses the format `"<source>/<replacement>"` (e.g., `["omg/oh my god"]`) to override how the model pronounces matching tokens.

You can pass these query parameters either in the WebSocket URL (e.g., wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy&sample_rate=24000&alignment=word) or dynamically via the tts_session.updated event after the connection is established.

Multi-context support

You can manage multiple independent TTS streams over a single WebSocket connection using context_id. This is useful for applications handling multiple simultaneous conversations or characters.

Add context_id to any client message to route it to a specific context.
Messages without context_id use the "default" context.
Each context maintains its own text buffer and voice settings.
Cancel a specific context with the context.cancel message type.
Send tts_session.updated without a context_id to update all contexts at once.
Maximum 100 contexts per connection.

Sending text to a specific context:

{
  "type": "input_text_buffer.append",
  "text": "Hello from context one.",
  "context_id": "conversation-1"
}

Cancelling a context:

{
  "type": "context.cancel",
  "context_id": "conversation-1"
}

The server confirms cancellation with a context.cancelled message:

{
  "type": "context.cancelled",
  "context_id": "conversation-1"
}

Documentation Index

​Establish a connection

​Authentication

​Client-to-server messages

​Append text to buffer

​Commit buffer

​Clear buffer

​Update session parameters

​Server-to-client messages

​Session created

​Text received acknowledgment

​Audio delta (streaming chunks)

​Audio complete

​Word timestamps

​TTS error

​WebSocket example

​WebSocket parameters

​Multi-context support

​See also

Establish a connection

Authentication

Client-to-server messages

Append text to buffer

Commit buffer

Clear buffer

Update session parameters

Server-to-client messages

Session created

Text received acknowledgment

Audio delta (streaming chunks)

Audio complete

Word timestamps

TTS error

WebSocket example

WebSocket parameters

Multi-context support

See also