Voice Activity Detection

Together AI’s real-time transcription API uses Voice Activity Detection (VAD) to automatically identify speech segments in an audio stream. While speech is ongoing, the server streams partial transcriptions as delta events. When VAD detects enough silence, the segment ends and the server emits a final completed event with the full transcript. VAD runs a dedicated model on the server to compute a speech probability for each audio frame. Frames above a configurable threshold are classified as speech, and the resulting speech regions are grouped into segments based on silence gaps, minimum durations, and padding.

Parameters

All VAD parameters are optional. If omitted, the server uses sensible defaults tuned for conversational audio.

Parameter	Type	Default	Description
`threshold`	float	`0.3`	Speech probability threshold (0.0–1.0). Frames above this value are classified as speech. Lower values detect more speech but may increase false positives.
`min_silence_duration_ms`	int	`500`	How long silence must last (in ms) before a speech segment ends. Higher values prevent splitting on brief pauses.
`min_speech_duration_ms`	int	`250`	Minimum segment length in ms. Segments shorter than this are discarded — useful for filtering noise bursts.
`max_speech_duration_s`	float	`5.0`	Maximum segment length in seconds. Longer segments are split at the best internal silence point.
`speech_pad_ms`	int	`250`	Padding added to the start and end of each segment. Prevents clipping speech edges. Adjacent segments never overlap — if padding would cause overlap, the gap is split at the midpoint.

Common configurations

Conversational audio (default)

The defaults work well for typical voice assistant and conversational use cases — clean microphone audio at 16kHz with turn-taking between speakers.

{
  "type": "server_vad",
  "threshold": 0.3,
  "min_silence_duration_ms": 500,
  "min_speech_duration_ms": 250,
  "max_speech_duration_s": 5.0,
  "speech_pad_ms": 250
}

Phone calls and low-quality audio

Phone audio (8kHz, low SNR) produces lower speech probabilities, so a much lower threshold is needed. Higher min_silence_duration_ms prevents splitting mid-sentence pauses common in call center recordings. A higher max_speech_duration_s allows longer uninterrupted turns.

{
  "type": "server_vad",
  "threshold": 0.01,
  "min_silence_duration_ms": 1000,
  "min_speech_duration_ms": 500,
  "max_speech_duration_s": 60,
  "speech_pad_ms": 10
}

Configuring VAD

VAD can be configured in two ways:

Query parameters at connection time

Pass VAD parameters directly in the WebSocket URL:

wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000&threshold=0.01&min_silence_duration_ms=1000

To disable VAD entirely, use turn_detection=none:

wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000&turn_detection=none

Session message after connection

Send a transcription_session.updated message after receiving session.created:

{
  "type": "transcription_session.updated",
  "session": {
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.01,
      "min_silence_duration_ms": 1000,
      "min_speech_duration_ms": 500,
      "max_speech_duration_s": 60,
      "speech_pad_ms": 10
    }
  }
}

To disable VAD via session message, set turn_detection to null:

{
  "type": "transcription_session.updated",
  "session": {
    "turn_detection": null
  }
}

Disabling VAD

When VAD is disabled, the server does not automatically segment audio. No completed events are emitted until you explicitly send an input_audio_buffer.commit message, at which point the entire buffered audio is transcribed. This is useful when your application controls segmentation externally.

Example: real-time transcription with custom VAD

import asyncio
import base64
import json
import os

import websockets

API_KEY = os.environ["TOGETHER_API_KEY"]
MODEL = "openai/whisper-large-v3"

VAD_CONFIG = {
    "type": "server_vad",
    "threshold": 0.3,
    "min_silence_duration_ms": 500,
    "min_speech_duration_ms": 250,
    "max_speech_duration_s": 5.0,
    "speech_pad_ms": 250,
}


async def transcribe():
    url = f"wss://api.together.ai/v1/realtime?model={MODEL}&input_audio_format=pcm_s16le_16000"
    headers = {"Authorization": f"Bearer {API_KEY}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Wait for session.created, then send VAD config
        msg = json.loads(await ws.recv())
        if msg["type"] == "session.created":
            await ws.send(
                json.dumps(
                    {
                        "type": "transcription_session.updated",
                        "session": {"turn_detection": VAD_CONFIG},
                    }
                )
            )

        # Send audio in 100ms chunks at real-time pace
        with open("audio.wav", "rb") as f:
            audio = f.read()

        CHUNK = 3200  # 100ms at 16kHz 16-bit
        for i in range(0, len(audio), CHUNK):
            await ws.send(
                json.dumps(
                    {
                        "type": "input_audio_buffer.append",
                        "audio": base64.b64encode(
                            audio[i : i + CHUNK]
                        ).decode(),
                    }
                )
            )
            await asyncio.sleep(0.1)

        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Receive transcription results
        async for message in ws:
            data = json.loads(message)
            if (
                data["type"]
                == "conversation.item.input_audio_transcription.completed"
            ):
                print(data["transcript"])
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.failed"
            ):
                print(f"Error: {data['error']['message']}")
                break


asyncio.run(transcribe())

Next steps

See the Speech-to-Text guide for full real-time streaming documentation
See the API reference for the complete WebSocket endpoint specification

GETTING STARTED

INFERENCE

TRAINING

CAPABILITIES

OTHER APIS

Parameters

Common configurations

Conversational audio (default)

Phone calls and low-quality audio

Configuring VAD

Query parameters at connection time

Session message after connection

Disabling VAD

Example: real-time transcription with custom VAD

Next steps

GETTING STARTED

INFERENCE

TRAINING

CAPABILITIES

OTHER APIS

​Parameters

​Common configurations

​Conversational audio (default)

​Phone calls and low-quality audio

​Configuring VAD

​Query parameters at connection time

​Session message after connection

​Disabling VAD

​Example: real-time transcription with custom VAD

​Next steps

Parameters

Common configurations

Conversational audio (default)

Phone calls and low-quality audio

Configuring VAD

Query parameters at connection time

Session message after connection

Disabling VAD

Example: real-time transcription with custom VAD

Next steps