Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

For the lowest latency and most interactive applications, use the WebSocket API. It lets you stream text input and receive audio chunks in real time over a single persistent connection, which is ideal for chatbots, live assistants, and voice agents. For one-shot requests where you only need a stream of audio bytes back, see Streaming instead.
The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.

Establish a connection

Connect to: wss://api.together.ai/v1/audio/speech/websocket

Authentication

  • Include your API key as a query parameter: ?api_key=<your_api_key>.
  • Or use the Authorization header when establishing the WebSocket connection.

Client-to-server messages

Append text to buffer

{
  "type": "input_text_buffer.append",
  "text": "Hello, this is a test sentence."
}
Appends text to the input buffer. Text is buffered until sentence completion or maximum length is reached.

Commit buffer

{
  "type": "input_text_buffer.commit"
}
Forces processing of all buffered text. Use this at the end of your input stream.

Clear buffer

{
  "type": "input_text_buffer.clear"
}
Clears all buffered text without processing (except text already being processed by the model).

Update session parameters

{
  "type": "tts_session.updated",
  "session": {
    "voice": "new_voice_id"
  }
}
Updates TTS session settings like voice in real time. If no context_id is specified, all contexts are updated.

Server-to-client messages

Session created

{
  "event_id": "uuid-string",
  "type": "session.created",
  "session": {
    "id": "session-uuid",
    "object": "realtime.tts.session",
    "modalities": ["text", "audio"],
    "model": "canopylabs/orpheus-3b-0.1-ft",
    "voice": "tara"
  }
}

Text received acknowledgment

{
  "type": "conversation.item.input_text.received",
  "text": "Hello, this is a test sentence."
}

Audio delta (streaming chunks)

{
  "type": "conversation.item.audio_output.delta",
  "item_id": "tts_1",
  "delta": "base64-encoded-audio-chunk"
}

Audio complete

{
  "type": "conversation.item.audio_output.done",
  "item_id": "tts_1"
}

Word timestamps

Sent when alignment=word is set. Contains word-level timing information for the generated audio.
{
  "type": "conversation.item.word_timestamps",
  "item_id": "tts_1",
  "words": ["Hello", "world"],
  "start_seconds": [0.0, 0.4],
  "end_seconds": [0.4, 0.8]
}

TTS error

{
  "type": "conversation.item.tts.failed",
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "error_code"
  }
}

WebSocket example

import asyncio
import aiohttp
import json
import base64
import os


async def generate_speech():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = (
        "wss://api.together.ai/v1/audio/speech"
        "/websocket?model=hexgrad/Kokoro-82M"
        "&voice=af_alloy"
        "&response_format=pcm"
        "&sample_rate=24000"
    )

    headers = {"Authorization": f"Bearer {api_key}"}

    text_chunks = [
        "Hello, this is a test.",
        "This is the second sentence.",
        "And this is the final one.",
    ]

    audio_chunks = []

    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.ws_connect(url) as ws:
            # Wait for session.created
            msg = await ws.receive()
            session_data = json.loads(msg.data)
            print(f"Session created: {session_data['session']['id']}")

            async def send_text():
                for chunk in text_chunks:
                    await ws.send_json(
                        {
                            "type": "input_text_buffer.append",
                            "text": chunk,
                        }
                    )
                    print(f"Sent: {chunk}")
                    await asyncio.sleep(0.5)
                await ws.send_json({"type": "input_text_buffer.commit"})
                print("Committed")

            async def receive_audio():
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        mtype = data.get("type", "")

                        if mtype == "conversation.item.audio_output.delta":
                            chunk = base64.b64decode(data.get("delta", ""))
                            audio_chunks.append(chunk)

                        elif mtype == "conversation.item.word_timestamps":
                            words = data.get("words", [])
                            starts = data.get("start_seconds", [])
                            stamps = list(
                                zip(words, [f"{s:.2f}s" for s in starts])
                            )
                            print(f"  timestamps: {stamps}")

                        elif mtype in (
                            "error",
                            "conversation.item.tts.failed",
                        ):
                            err = data.get(
                                "error",
                                data.get("message"),
                            )
                            print(f"Error: {err}")
                            return

                    elif msg.type in (
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break

            send_task = asyncio.create_task(send_text())
            recv_task = asyncio.create_task(receive_audio())

            await send_task

            # Wait up to 10s for audio to stop arriving
            deadline = asyncio.get_event_loop().time() + 10
            while asyncio.get_event_loop().time() < deadline:
                await asyncio.sleep(0.1)
                n = len(audio_chunks)
                await asyncio.sleep(0.3)
                if len(audio_chunks) == n:
                    break

            recv_task.cancel()
            try:
                await recv_task
            except asyncio.CancelledError:
                pass

    if audio_chunks:
        pcm = b"".join(audio_chunks)
        with open("output.pcm", "wb") as f:
            f.write(pcm)
        print(
            f"\nAudio saved to output.pcm ({len(pcm):,} bytes, "
            f"{len(pcm)/48000:.1f}s at 24kHz)"
        )
        print("Play with: ffplay -f s16le -ar 24000 output.pcm")
    else:
        print("No audio received")


asyncio.run(generate_speech())

WebSocket parameters

When establishing a WebSocket connection, you can configure:
ParameterTypeDescription
modelstringThe TTS model to use
voicestringThe voice for generation
response_formatstringAudio format: mp3, opus, aac, flac, wav, or pcm
speedfloatPlayback speed (default: 1.0)
max_partial_lengthintegerCharacter buffer length before triggering TTS generation
sample_rateintegerThe sample rate of the output audio in Hz (e.g., 24000, 44100)
languagestringThe language or locale code for speech synthesis (e.g., en, fr, es). Locales are supported and must be lowercase (e.g., zh-hk for Cantonese)
alignmentstringControls word-level timestamp generation. Set to word to receive conversation.item.word_timestamps events, or none to disable (default: none)
segmentstringControls how text is segmented before synthesis. Options: sentence (default) splits on sentence boundaries, immediate processes text as soon as it arrives, never waits until buffer is committed
extra_paramsobjectAdditional model-specific parameters. Supported fields:
    pronunciation_dictarrayA list of pronunciation rules for specific characters or symbols. Each entry uses the format "<source>/<replacement>" (e.g., ["omg/oh my god"]) to override how the model pronounces matching tokens.
You can pass these query parameters either in the WebSocket URL (e.g., wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy&sample_rate=24000&alignment=word) or dynamically via the tts_session.updated event after the connection is established.

Multi-context support

You can manage multiple independent TTS streams over a single WebSocket connection using context_id. This is useful for applications handling multiple simultaneous conversations or characters.
  • Add context_id to any client message to route it to a specific context.
  • Messages without context_id use the "default" context.
  • Each context maintains its own text buffer and voice settings.
  • Cancel a specific context with the context.cancel message type.
  • Send tts_session.updated without a context_id to update all contexts at once.
  • Maximum 100 contexts per connection.
Sending text to a specific context:
{
  "type": "input_text_buffer.append",
  "text": "Hello from context one.",
  "context_id": "conversation-1"
}
Cancelling a context:
{
  "type": "context.cancel",
  "context_id": "conversation-1"
}
The server confirms cancellation with a context.cancelled message:
{
  "type": "context.cancelled",
  "context_id": "conversation-1"
}

See also