Skip to main content
Together AI provides comprehensive audio transcription and translation capabilities powered by state-of-the-art speech recognition models including OpenAI’s Whisper and Voxtral. This guide covers everything from batch transcription to real-time streaming for low-latency applications.

Quick Start

Here’s how to get started with basic transcription and translation:
from together import Together

client = Together()

## Basic transcription

response = client.audio.transcriptions.create(
    file="path/to/audio.mp3",
    model="openai/whisper-large-v3",
    language="en",
)
print(response.text)

## Basic translation

response = client.audio.translations.create(
    file="path/to/foreign_audio.mp3",
    model="openai/whisper-large-v3",
)
print(response.text)

Available Models

Together AI supports multiple speech-to-text models:
OrganizationModel NameModel String for APICapabilities
OpenAIWhisper Large v3openai/whisper-large-v3Real-time, Translation, Diarization
Mistral AIVoxtral Mini 3Bmistralai/Voxtral-Mini-3B-2507

Audio Transcription

Audio transcription converts speech to text in the same language as the source audio.
from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting_recording.mp3",
    model="openai/whisper-large-v3",
    language="en",
    response_format="json",
)

print(f"Transcription: {response.text}")
The API supports the following audio formats:
  • .wav (audio/wav)
  • .mp3 (audio/mpeg)
  • .m4a (audio/mp4)
  • .webm (audio/webm)
  • .flac (audio/flac)
Input Methods Local File Path
Python
response = client.audio.transcriptions.create(
    file="/path/to/audio.mp3",
    model="openai/whisper-large-v3",
)
Path Object
Python
from pathlib import Path

audio_file = Path("recordings/interview.wav")
response = client.audio.transcriptions.create(
    file=audio_file,
    model="openai/whisper-large-v3",
)
URL
Python
response = client.audio.transcriptions.create(
    file="https://example.com/audio.mp3", model="openai/whisper-large-v3"
)
File-like Object
Python
with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3",
    )
Language Support Specify the audio language using ISO 639-1 language codes:
Python
response = client.audio.transcriptions.create(
    file="spanish_audio.mp3",
    model="openai/whisper-large-v3",
    language="es",  # Spanish
)
Common specifiable language codes:
  • “en” - English
  • “es” - Spanish
  • “fr” - French
  • “de” - German
  • “ja” - Japanese
  • “zh” - Chinese
  • “auto” - Auto-detect (default)
Custom Prompts Use prompts to improve transcription accuracy for specific contexts:
response = client.audio.transcriptions.create(
    file="medical_consultation.mp3",
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options.",
)

Real-time Streaming Transcription

For applications requiring the lowest latency, use the real-time WebSocket API. This provides streaming transcription with incremental results.
The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.
Establishing a Connection Connect to: wss://api.together.ai/v1/realtime?model={model}&input_audio_format=pcm_s16le_16000 Headers:
{
  'Authorization': 'Bearer YOUR_API_KEY',
  'OpenAI-Beta': 'realtime=v1'
}
Query Parameters
ParameterTypeRequiredDescription
modelstringYesModel to use (e.g., openai/whisper-large-v3)
input_audio_formatstringYesAudio format: pcm_s16le_16000
Client → Server Messages 1. Append Audio to Buffer
{
  "type": "input_audio_buffer.append",
  "audio": "base64-encoded-audio-chunk"
}
Send audio data in base64-encoded PCM format. 2. Commit Audio Buffer
{
  "type": "input_audio_buffer.commit"
}
Forces transcription of any remaining audio in the server-side buffer. Server → Client Messages Delta Events (Intermediate Results)
{
  "type": "conversation.item.input_audio_transcription.delta",
  "delta": "The quick brown fox jumps"
}
Delta events are intermediate transcriptions. The model is still processing and may revise the output. Each delta message overrides the previous delta. Completed Events (Final Results)
{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "The quick brown fox jumps over the lazy dog"
}
Completed events are final transcriptions. The model is confident about this text. The next delta event will continue from where this completed. Real-time Example
import asyncio
import websockets
import json
import base64
import os


async def transcribe_audio():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = "wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000"

    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Read audio file
        with open("audio.wav", "rb") as f:
            audio_data = f.read()

        # Send audio in chunks with delay to simulate real-time
        chunk_size = 8192
        bytes_per_second = 16000 * 2  # 16kHz * 2 bytes (16-bit)
        delay_per_chunk = chunk_size / bytes_per_second

        for i in range(0, len(audio_data), chunk_size):
            chunk = audio_data[i : i + chunk_size]
            base64_chunk = base64.b64encode(chunk).decode("utf-8")
            await ws.send(
                json.dumps(
                    {
                        "type": "input_audio_buffer.append",
                        "audio": base64_chunk,
                    }
                )
            )
            # Simulate real-time streaming
            if i + chunk_size < len(audio_data):
                await asyncio.sleep(delay_per_chunk)

        # Commit the audio buffer
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        # Receive transcription results
        async for message in ws:
            data = json.loads(message)
            if (
                data["type"]
                == "conversation.item.input_audio_transcription.delta"
            ):
                print(f"Partial: {data['delta']}")
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.completed"
            ):
                print(f"Final: {data['transcript']}")
                break
            elif (
                data["type"]
                == "conversation.item.input_audio_transcription.failed"
            ):
                error = data.get("error", {})
                print(f"Error: {error.get('message')}")
                break


asyncio.run(transcribe_audio())

Audio Translation

Audio translation converts speech from any language to English text.
response = client.audio.translations.create(
    file="french_audio.mp3",
    model="openai/whisper-large-v3",
)
print(f"English translation: {response.text}")
Translation with Context
response = client.audio.translations.create(
    file="business_meeting_spanish.mp3",
    model="openai/whisper-large-v3",
    prompt="This is a business meeting discussing quarterly sales results.",
)

Speaker Diarization

Enable diarization to identify who is speaking when:
from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    diarize="true",  # Enable speaker diarization
)

# Access speaker segments
print(response.speaker_segments)
Example Response with Diarization:
AudioSpeakerSegment(
    id=1,
    speaker_id='SPEAKER_01',
    start=6.268,
    end=30.776,
    text=(
        "Hello. Oh, hey, Justin. How are you doing? ..."
    ),
    words=[
        AudioTranscriptionWord(
            word='Hello.',
            start=6.268,
            end=11.314,
            id=0,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='Oh,',
            start=11.834,
            end=11.894,
            id=1,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='hey,',
            start=11.914,
            end=11.995,
            id=2,
            speaker_id='SPEAKER_01'
        ),
        ...
    ]
)

Word-level Timestamps

Get word-level timing information:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="word",
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")

## Access individual words with timestamps
if response.words:
    for word in response.words:
        print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")
Example Output:
Text
Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
Task: None

'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]

Response Formats

JSON Format (Default) Returns only the transcribed/translated text:
Python
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="json",
)

print(response.text)  # "Hello, this is a test recording."
Verbose JSON Format Returns detailed information including timestamps:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="segment",
)

## Access segments with timestamps
for segment in response.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
Example Output:
Text
[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...

[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.

[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...

[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.

[43.50s - 44.20s]: you

Advanced Features

Temperature Control Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    temperature=0.0,  # Most deterministic
)

print(f"Text: {response.text}")

Async Support

All transcription and translation operations support async/await: Async Transcription
Python
import asyncio
from together import AsyncTogether


async def transcribe_audio():
    client = AsyncTogether()

    response = await client.audio.transcriptions.create(
        file="audio.mp3",
        model="openai/whisper-large-v3",
        language="en",
    )

    return response.text


## Run async function
result = asyncio.run(transcribe_audio())
print(result)
Async Translation
Python
async def translate_audio():
    client = AsyncTogether()

    response = await client.audio.translations.create(
        file="foreign_audio.mp3",
        model="openai/whisper-large-v3",
    )

    return response.text


result = asyncio.run(translate_audio())
print(result)
Concurrent Processing Process multiple audio files concurrently:
Python
import asyncio
from together import AsyncTogether


async def process_multiple_files():
    client = AsyncTogether()

    files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

    tasks = [
        client.audio.transcriptions.create(
            file=file,
            model="openai/whisper-large-v3",
        )
        for file in files
    ]

    responses = await asyncio.gather(*tasks)

    for i, response in enumerate(responses):
        print(f"File {files[i]}: {response.text}")


asyncio.run(process_multiple_files())

Best Practices

Choosing the Right Method
  • Batch Transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case
  • Real-time Streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback
Audio Quality Tips
  • Use high-quality audio files for better transcription accuracy
  • Minimize background noise
  • Ensure clear speech with good volume levels
  • Use appropriate sample rates (16kHz or higher recommended)
  • For WebSocket streaming, use PCM format: pcm_s16le_16000
  • Consider file size limits for uploads
  • For long audio files, consider splitting into smaller chunks
  • Use streaming for real-time applications when available
Diarization Best Practices
  • Works best with clear audio and distinct speakers
  • Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
  • Use with verbose_json format to get segment-level speaker information
Next Steps