Together AI’s real-time transcription API uses Voice Activity Detection (VAD) to automatically identify speech segments in an audio stream. While speech is ongoing, the server streams partial transcriptions asDocumentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
delta events. When VAD detects enough silence, the segment ends and the server emits a final completed event with the full transcript.
VAD runs a dedicated model on the server to compute a speech probability for each audio frame. Frames above a configurable threshold are classified as speech, and the resulting speech regions are grouped into segments based on silence gaps, minimum durations, and padding.
Parameters
All VAD parameters are optional. If omitted, the server uses sensible defaults tuned for conversational audio.| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.3 | Speech probability threshold (0.0–1.0). Frames above this value are classified as speech. Lower values detect more speech but may increase false positives. |
min_silence_duration_ms | int | 500 | How long silence must last (in ms) before a speech segment ends. Higher values prevent splitting on brief pauses. |
min_speech_duration_ms | int | 250 | Minimum segment length in ms. Segments shorter than this are discarded, useful for filtering noise bursts. |
max_speech_duration_s | float | 5.0 | Maximum segment length in seconds. Longer segments are split at the best internal silence point. |
speech_pad_ms | int | 250 | Padding added to the start and end of each segment. Prevents clipping speech edges. Adjacent segments never overlap; if padding would cause overlap, the gap is split at the midpoint. |
Common configurations
Conversational audio (default)
The defaults work well for typical voice assistant and conversational use cases: clean microphone audio at 16kHz with turn-taking between speakers.Phone calls and low-quality audio
Phone audio (8kHz, low SNR) produces lower speech probabilities, so a much lower threshold is needed. Highermin_silence_duration_ms prevents splitting mid-sentence pauses common in call center recordings. A higher max_speech_duration_s allows longer uninterrupted turns.
Configure VAD
You can configure VAD in two ways:Query parameters at connection time
Pass VAD parameters directly in the WebSocket URL:turn_detection=none:
Session message after connection
Send atranscription_session.updated message after receiving session.created:
turn_detection to null:
Disable VAD
With VAD disabled, the server does not automatically segment audio. Nocompleted events are emitted until you explicitly send an input_audio_buffer.commit message, at which point the entire buffered audio is transcribed. This is useful when your application controls segmentation externally.
Example: real-time transcription with custom VAD
Next steps
- See Streaming transcription for the full real-time streaming guide.
- See the API reference for the complete WebSocket endpoint specification.