Skip to main content
Together AI’s real-time transcription API uses Voice Activity Detection (VAD) to automatically identify speech segments in an audio stream. While speech is ongoing, the server streams partial transcriptions as delta events. When VAD detects enough silence, the segment ends and the server emits a final completed event with the full transcript. VAD runs a dedicated model on the server to compute a speech probability for each audio frame. Frames above a configurable threshold are classified as speech, and the resulting speech regions are grouped into segments based on silence gaps, minimum durations, and padding.

Parameters

All VAD parameters are optional. If omitted, the server uses sensible defaults tuned for conversational audio.
ParameterTypeDefaultDescription
thresholdfloat0.3Speech probability threshold (0.0–1.0). Frames above this value are classified as speech. Lower values detect more speech but may increase false positives.
min_silence_duration_msint500How long silence must last (in ms) before a speech segment ends. Higher values prevent splitting on brief pauses.
min_speech_duration_msint250Minimum segment length in ms. Segments shorter than this are discarded — useful for filtering noise bursts.
max_speech_duration_sfloat5.0Maximum segment length in seconds. Longer segments are split at the best internal silence point.
speech_pad_msint250Padding added to the start and end of each segment. Prevents clipping speech edges. Adjacent segments never overlap — if padding would cause overlap, the gap is split at the midpoint.

Common configurations

Conversational audio (default)

The defaults work well for typical voice assistant and conversational use cases — clean microphone audio at 16kHz with turn-taking between speakers.
{
  "type": "server_vad",
  "threshold": 0.3,
  "min_silence_duration_ms": 500,
  "min_speech_duration_ms": 250,
  "max_speech_duration_s": 5.0,
  "speech_pad_ms": 250
}

Phone calls and low-quality audio

Phone audio (8kHz, low SNR) produces lower speech probabilities, so a much lower threshold is needed. Higher min_silence_duration_ms prevents splitting mid-sentence pauses common in call center recordings. A higher max_speech_duration_s allows longer uninterrupted turns.
{
  "type": "server_vad",
  "threshold": 0.01,
  "min_silence_duration_ms": 1000,
  "min_speech_duration_ms": 500,
  "max_speech_duration_s": 60,
  "speech_pad_ms": 10
}

Configuring VAD

VAD can be configured in two ways:

Query parameters at connection time

Pass VAD parameters directly in the WebSocket URL:
wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000&threshold=0.01&min_silence_duration_ms=1000
To disable VAD entirely, use turn_detection=none:
wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000&turn_detection=none

Session message after connection

Send a transcription_session.updated message after receiving session.created:
{
  "type": "transcription_session.updated",
  "session": {
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.01,
      "min_silence_duration_ms": 1000,
      "min_speech_duration_ms": 500,
      "max_speech_duration_s": 60,
      "speech_pad_ms": 10
    }
  }
}
To disable VAD via session message, set turn_detection to null:
{
  "type": "transcription_session.updated",
  "session": {
    "turn_detection": null
  }
}

Disabling VAD

When VAD is disabled, the server does not automatically segment audio. No completed events are emitted until you explicitly send an input_audio_buffer.commit message, at which point the entire buffered audio is transcribed. This is useful when your application controls segmentation externally.

Example: real-time transcription with custom VAD

import asyncio
import base64
import json
import os

import websockets

API_KEY = os.environ["TOGETHER_API_KEY"]
MODEL = "openai/whisper-large-v3"

VAD_CONFIG = {
    "type": "server_vad",
    "threshold": 0.3,
    "min_silence_duration_ms": 500,
    "min_speech_duration_ms": 250,
    "max_speech_duration_s": 5.0,
    "speech_pad_ms": 250,
}


async def transcribe():
    url = f"wss://api.together.ai/v1/realtime?model={MODEL}&input_audio_format=pcm_s16le_16000"
    headers = {"Authorization": f"Bearer {API_KEY}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Wait for session.created, then send VAD config
        msg = json.loads(await ws.recv())
        if msg["type"] == "session.created":
            await ws.send(
                json.dumps(
                    {
                        "type": "transcription_session.updated",
                        "session": {"turn_detection": VAD_CONFIG},
                    }
                )
            )

        # Send audio in 100ms chunks at real-time pace
        with open("audio.wav", "rb") as f:
            audio = f.read()

        CHUNK = 3200  # 100ms at 16kHz 16-bit
        for i in range(0, len(audio), CHUNK):
            await ws.send(
                json.dumps(
                    {
                        "type": "input_audio_buffer.append",
                        "audio": base64.b64encode(
                            audio[i : i + CHUNK]
                        ).decode(),
                    }
                )
            )
            await asyncio.sleep(0.1)

        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Receive transcription results
        async for message in ws:
            data = json.loads(message)
            if (
                data["type"]
                == "conversation.item.input_audio_transcription.completed"
            ):
                print(data["transcript"])
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.failed"
            ):
                print(f"Error: {data['error']['message']}")
                break


asyncio.run(transcribe())

Next steps