Speech-to-Text

Together AI provides comprehensive audio transcription and translation capabilities powered by state-of-the-art speech recognition models including OpenAI’s Whisper and Voxtral. This guide covers everything from batch transcription to real-time streaming for low-latency applications.

Quick Start

Here’s how to get started with basic transcription and translation:

from together import Together

client = Together()

## Basic transcription

response = client.audio.transcriptions.create(
    file="path/to/audio.mp3",
    model="openai/whisper-large-v3",
    language="en",
)
print(response.text)

## Basic translation

response = client.audio.translations.create(
    file="path/to/foreign_audio.mp3",
    model="openai/whisper-large-v3",
)
print(response.text)

Available Models

Together AI supports multiple speech-to-text models:

Organization	Model Name	Model String for API	Capabilities
OpenAI	Whisper Large v3	openai/whisper-large-v3	Real-time, Translation, Diarization
Mistral AI	Voxtral Mini 3B	mistralai/Voxtral-Mini-3B-2507

Audio Transcription

Audio transcription converts speech to text in the same language as the source audio.

from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting_recording.mp3",
    model="openai/whisper-large-v3",
    language="en",
    response_format="json",
)

print(f"Transcription: {response.text}")

The API supports the following audio formats:

.wav (audio/wav)
.mp3 (audio/mpeg)
.m4a (audio/mp4)
.webm (audio/webm)
.flac (audio/flac)

Input Methods Local File Path

Python

response = client.audio.transcriptions.create(
    file="/path/to/audio.mp3",
    model="openai/whisper-large-v3",
)

Path Object

Python

from pathlib import Path

audio_file = Path("recordings/interview.wav")
response = client.audio.transcriptions.create(
    file=audio_file,
    model="openai/whisper-large-v3",
)

URL

Python

response = client.audio.transcriptions.create(
    file="https://example.com/audio.mp3", model="openai/whisper-large-v3"
)

File-like Object

Python

with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3",
    )

Language Support Specify the audio language using ISO 639-1 language codes:

Python

response = client.audio.transcriptions.create(
    file="spanish_audio.mp3",
    model="openai/whisper-large-v3",
    language="es",  # Spanish
)

Common specifiable language codes:

“en” - English
“es” - Spanish
“fr” - French
“de” - German
“ja” - Japanese
“zh” - Chinese
“auto” - Auto-detect (default)

Custom Prompts Use prompts to improve transcription accuracy for specific contexts:

response = client.audio.transcriptions.create(
    file="medical_consultation.mp3",
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options.",
)

Real-time Streaming Transcription

For applications requiring the lowest latency, use the real-time WebSocket API. This provides streaming transcription with incremental results.

The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.

Establishing a Connection Connect to: wss://api.together.ai/v1/realtime?model={model}&input_audio_format=pcm_s16le_16000 Headers:

{
  'Authorization': 'Bearer YOUR_API_KEY',
  'OpenAI-Beta': 'realtime=v1'
}

Query Parameters

Parameter	Type	Required	Description
model	string	Yes	Model to use (e.g., `openai/whisper-large-v3`)
input_audio_format	string	Yes	Audio format: `pcm_s16le_16000`

Client → Server Messages 1. Append Audio to Buffer

{
  "type": "input_audio_buffer.append",
  "audio": "base64-encoded-audio-chunk"
}

Send audio data in base64-encoded PCM format. 2. Commit Audio Buffer

{
  "type": "input_audio_buffer.commit"
}

Forces transcription of any remaining audio in the server-side buffer. Server → Client Messages Delta Events (Intermediate Results)

{
  "type": "conversation.item.input_audio_transcription.delta",
  "delta": "The quick brown fox jumps"
}

Delta events are intermediate transcriptions. The model is still processing and may revise the output. Each delta message overrides the previous delta. Completed Events (Final Results)

{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "The quick brown fox jumps over the lazy dog"
}

Completed events are final transcriptions. The model is confident about this text. The next delta event will continue from where this completed. Real-time Example

import asyncio
import websockets
import json
import base64
import os


async def transcribe_audio():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = "wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000"

    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Read audio file
        with open("audio.wav", "rb") as f:
            audio_data = f.read()

        # Send audio in chunks with delay to simulate real-time
        chunk_size = 8192
        bytes_per_second = 16000 * 2  # 16kHz * 2 bytes (16-bit)
        delay_per_chunk = chunk_size / bytes_per_second

        for i in range(0, len(audio_data), chunk_size):
            chunk = audio_data[i : i + chunk_size]
            base64_chunk = base64.b64encode(chunk).decode("utf-8")
            await ws.send(
                json.dumps(
                    {
                        "type": "input_audio_buffer.append",
                        "audio": base64_chunk,
                    }
                )
            )
            # Simulate real-time streaming
            if i + chunk_size < len(audio_data):
                await asyncio.sleep(delay_per_chunk)

        # Commit the audio buffer
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Receive transcription results
        async for message in ws:
            data = json.loads(message)
            if (
                data["type"]
                == "conversation.item.input_audio_transcription.delta"
            ):
                print(f"Partial: {data['delta']}")
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.completed"
            ):
                print(f"Final: {data['transcript']}")
                break
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.failed"
            ):
                error = data.get("error", {})
                print(f"Error: {error.get('message')}")
                break


asyncio.run(transcribe_audio())

Audio Translation

Audio translation converts speech from any language to English text.

response = client.audio.translations.create(
    file="french_audio.mp3",
    model="openai/whisper-large-v3",
)
print(f"English translation: {response.text}")

Translation with Context

response = client.audio.translations.create(
    file="business_meeting_spanish.mp3",
    model="openai/whisper-large-v3",
    prompt="This is a business meeting discussing quarterly sales results.",
)

Speaker Diarization

Enable diarization to identify who is speaking when:

from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    diarize="true",  # Enable speaker diarization
)

# Access speaker segments
print(response.speaker_segments)

Example Response with Diarization:

AudioSpeakerSegment(
    id=1,
    speaker_id='SPEAKER_01',
    start=6.268,
    end=30.776,
    text=(
        "Hello. Oh, hey, Justin. How are you doing? ..."
    ),
    words=[
        AudioTranscriptionWord(
            word='Hello.',
            start=6.268,
            end=11.314,
            id=0,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='Oh,',
            start=11.834,
            end=11.894,
            id=1,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='hey,',
            start=11.914,
            end=11.995,
            id=2,
            speaker_id='SPEAKER_01'
        ),
        ...
    ]
)

Word-level Timestamps

Get word-level timing information:

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="word",
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")

## Access individual words with timestamps
if response.words:
    for word in response.words:
        print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")

Example Output:

Text

Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
Task: None

'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]

Response Formats

JSON Format (Default) Returns only the transcribed/translated text:

Python

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="json",
)

print(response.text)  # "Hello, this is a test recording."

Verbose JSON Format Returns detailed information including timestamps:

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="segment",
)

## Access segments with timestamps
for segment in response.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")

Example Output:

Text

[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...

[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.

[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...

[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.

[43.50s - 44.20s]: you

Advanced Features

Temperature Control Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    temperature=0.0,  # Most deterministic
)

print(f"Text: {response.text}")

Async Support

All transcription and translation operations support async/await: Async Transcription

Python

import asyncio
from together import AsyncTogether


async def transcribe_audio():
    client = AsyncTogether()

    response = await client.audio.transcriptions.create(
        file="audio.mp3",
        model="openai/whisper-large-v3",
        language="en",
    )

    return response.text


## Run async function
result = asyncio.run(transcribe_audio())
print(result)

Async Translation

Python

async def translate_audio():
    client = AsyncTogether()

    response = await client.audio.translations.create(
        file="foreign_audio.mp3",
        model="openai/whisper-large-v3",
    )

    return response.text


result = asyncio.run(translate_audio())
print(result)

Concurrent Processing Process multiple audio files concurrently:

Python

import asyncio
from together import AsyncTogether


async def process_multiple_files():
    client = AsyncTogether()

    files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

    tasks = [
        client.audio.transcriptions.create(
            file=file,
            model="openai/whisper-large-v3",
        )
        for file in files
    ]

    responses = await asyncio.gather(*tasks)

    for i, response in enumerate(responses):
        print(f"File {files[i]}: {response.text}")


asyncio.run(process_multiple_files())

Best Practices

Choosing the Right Method

Batch Transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case
Real-time Streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback

Audio Quality Tips

Use high-quality audio files for better transcription accuracy
Minimize background noise
Ensure clear speech with good volume levels
Use appropriate sample rates (16kHz or higher recommended)
For WebSocket streaming, use PCM format: pcm_s16le_16000
Consider file size limits for uploads
For long audio files, consider splitting into smaller chunks
Use streaming for real-time applications when available

Diarization Best Practices

Works best with clear audio and distinct speakers
Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
Use with verbose_json format to get segment-level speaker information

Next Steps

Explore our API Reference for detailed parameter documentation
Learn about Text-to-Speech for the reverse operation
Check out our Real-time Audio Transcription App guide

Getting Started

Inference

Training

Capabilities

Other APIs

Quick Start

Available Models

Audio Transcription

Real-time Streaming Transcription

Audio Translation

Speaker Diarization

Word-level Timestamps

Response Formats

Advanced Features

Async Support

Best Practices

Getting Started

Inference

Training

Capabilities

Other APIs

​Quick Start

​Available Models

​Audio Transcription

​Real-time Streaming Transcription

​Audio Translation

​Speaker Diarization

​Word-level Timestamps

​Response Formats

​Advanced Features

​Async Support

​Best Practices

Quick Start

Available Models

Audio Transcription

Real-time Streaming Transcription

Audio Translation

Speaker Diarization

Word-level Timestamps

Response Formats

Advanced Features

Async Support

Best Practices