Skip to main content
Together AI provides comprehensive audio transcription and translation capabilities powered by state-of-the-art speech recognition models including OpenAI’s Whisper and Voxtral. This guide covers everything from batch transcription to real-time streaming for low-latency applications.

Quick Start

Here’s how to get started with basic transcription and translation:
from together import Together

client = Together()

## Basic transcription

response = client.audio.transcriptions.create(
    file="path/to/audio.mp3",
    model="openai/whisper-large-v3",
    language="en",
)
print(response.text)

## Basic translation

response = client.audio.translations.create(
    file="path/to/foreign_audio.mp3",
    model="openai/whisper-large-v3",
)
print(response.text)

Available Models

Together AI supports multiple speech-to-text models:
OrganizationModel NameModel String for APICapabilities
OpenAIWhisper Large v3openai/whisper-large-v3Real-time, Translation, Diarization
Mistral AIVoxtral Mini 3Bmistralai/Voxtral-Mini-3B-2507

Audio Transcription

Audio transcription converts speech to text in the same language as the source audio.
from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting_recording.mp3",
    model="openai/whisper-large-v3",
    language="en",
    response_format="json",
)

print(f"Transcription: {response.text}")
The API supports the following audio formats:
  • .wav (audio/wav)
  • .mp3 (audio/mpeg)
  • .m4a (audio/mp4)
  • .webm (audio/webm)
  • .flac (audio/flac)
Input Methods Local File Path
response = client.audio.transcriptions.create(
    file="/path/to/audio.mp3",
    model="openai/whisper-large-v3",
)
Path Object
from pathlib import Path

audio_file = Path("recordings/interview.wav")
response = client.audio.transcriptions.create(
    file=audio_file,
    model="openai/whisper-large-v3",
)
URL
Python
response = client.audio.transcriptions.create(
    file="https://example.com/audio.mp3", model="openai/whisper-large-v3"
)
File-like Object
Python
with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3",
    )
Language Support Specify the audio language using ISO 639-1 language codes:
response = client.audio.transcriptions.create(
    file="spanish_audio.mp3",
    model="openai/whisper-large-v3",
    language="es",  # Spanish
)
Common specifiable language codes:
  • “en” - English
  • “es” - Spanish
  • “fr” - French
  • “de” - German
  • “ja” - Japanese
  • “zh” - Chinese
  • “auto” - Auto-detect (default)
Custom Prompts Use prompts to improve transcription accuracy for specific contexts:
response = client.audio.transcriptions.create(
    file="medical_consultation.mp3",
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options.",
)

Real-time Streaming Transcription

For applications requiring the lowest latency, use the real-time WebSocket API. This provides streaming transcription with incremental results.
The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.
Establishing a Connection Connect to: wss://api.together.ai/v1/realtime?model={model}&input_audio_format=pcm_s16le_16000 Headers:
{
  'Authorization': 'Bearer YOUR_API_KEY',
  'OpenAI-Beta': 'realtime=v1'
}
Query Parameters
ParameterTypeRequiredDescription
modelstringYesModel to use (e.g., openai/whisper-large-v3)
input_audio_formatstringYesAudio format: pcm_s16le_16000
Client → Server Messages 1. Append Audio to Buffer
{
  "type": "input_audio_buffer.append",
  "audio": "base64-encoded-audio-chunk"
}
Send audio data in base64-encoded PCM format. 2. Commit Audio Buffer
{
  "type": "input_audio_buffer.commit"
}
Forces transcription of any remaining audio in the server-side buffer. Server → Client Messages Delta Events (Intermediate Results)
{
  "type": "conversation.item.input_audio_transcription.delta",
  "delta": "The quick brown fox jumps"
}
Delta events are intermediate transcriptions. The model is still processing and may revise the output. Each delta message overrides the previous delta. Completed Events (Final Results)
{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "The quick brown fox jumps over the lazy dog"
}
Completed events are final transcriptions. The model is confident about this text. The next delta event will continue from where this completed. Real-time Example
import asyncio
import base64
import json
import os
import sys

import numpy as np
import sounddevice as sd
import websockets

# Configuration
API_KEY = os.getenv("TOGETHER_API_KEY")
MODEL = "openai/whisper-large-v3"
SAMPLE_RATE = 16000
BATCH_SIZE = 4096  # 256ms batches for optimal performance

if not API_KEY:
    print("Error: Set TOGETHER_API_KEY environment variable")
    sys.exit(1)


class RealtimeTranscriber:
    """Realtime transcription client for Together AI."""

    def __init__(self):
        self.ws = None
        self.stream = None
        self.is_ready = False
        self.audio_buffer = np.array([], dtype=np.float32)
        self.audio_queue = asyncio.Queue()

    async def connect(self):
        """Connect to Together AI API."""
        url = (
            f"wss://api.together.xyz/v1/realtime"
            f"?intent=transcription"
            f"&model={MODEL}"
            f"&input_audio_format=pcm_s16le_16000"
            f"&authorization=Bearer {API_KEY}"
        )

        self.ws = await websockets.connect(
            url,
            subprotocols=[
                "realtime",
                f"openai-insecure-api-key.{API_KEY}",
                "openai-beta.realtime-v1",
            ],
        )

    async def send_audio(self):
        """Capture and send audio to API."""

        def audio_callback(indata, frames, time, status):
            self.audio_queue.put_nowait(indata.copy().flatten())

        # Start microphone stream
        self.stream = sd.InputStream(
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype="float32",
            blocksize=1024,
            callback=audio_callback,
        )
        self.stream.start()

        # Process and send audio
        while True:
            try:
                audio = await asyncio.wait_for(
                    self.audio_queue.get(), timeout=0.1
                )

                if self.ws and self.is_ready:
                    # Add to buffer
                    self.audio_buffer = np.concatenate(
                        [self.audio_buffer, audio]
                    )

                    # Send when buffer is full
                    while len(self.audio_buffer) >= BATCH_SIZE:
                        batch = self.audio_buffer[:BATCH_SIZE]
                        self.audio_buffer = self.audio_buffer[BATCH_SIZE:]

                        # Convert float32 to int16 PCM
                        audio_int16 = (
                            np.clip(batch, -1.0, 1.0) * 32767
                        ).astype(np.int16)
                        audio_base64 = base64.b64encode(
                            audio_int16.tobytes()
                        ).decode()

                        # Send to API
                        await self.ws.send(
                            json.dumps(
                                {
                                    "type": "input_audio_buffer.append",
                                    "audio": audio_base64,
                                }
                            )
                        )

            except asyncio.TimeoutError:
                continue
            except Exception as e:
                print(f"Error: {e}", file=sys.stderr)
                break

    async def receive_transcriptions(self):
        """Receive and display transcription results."""
        current_interim = ""

        try:
            async for message in self.ws:
                data = json.loads(message)

                if data["type"] == "session.created":
                    self.is_ready = True

                elif (
                    data["type"]
                    == "conversation.item.input_audio_transcription.delta"
                ):
                    # Interim result
                    print(
                        f"\r\033[90m{data['delta']}\033[0m", end="", flush=True
                    )
                    current_interim = data["delta"]

                elif (
                    data["type"]
                    == "conversation.item.input_audio_transcription.completed"
                ):
                    # Final result
                    if current_interim:
                        print("\r\033[K", end="")
                    print(f"\033[92m{data['transcript']}\033[0m")
                    current_interim = ""

                elif data["type"] == "error":
                    print(f"\nError: {data.get('message', 'Unknown error')}")

        except websockets.exceptions.ConnectionClosed:
            pass

    async def close(self):
        """Close connections and cleanup."""
        if self.stream:
            self.stream.stop()
            self.stream.close()

        # Flush remaining audio
        if len(self.audio_buffer) > 0 and self.ws and self.is_ready:
            try:
                audio_int16 = (
                    np.clip(self.audio_buffer, -1.0, 1.0) * 32767
                ).astype(np.int16)
                audio_base64 = base64.b64encode(audio_int16.tobytes()).decode()
                await self.ws.send(
                    json.dumps(
                        {
                            "type": "input_audio_buffer.append",
                            "audio": audio_base64,
                        }
                    )
                )
            except Exception:
                pass

        if self.ws:
            await self.ws.close()

    async def run(self):
        """Main execution loop."""
        try:
            print("🎤 Together AI Realtime Transcription")
            print("=" * 40)
            print("Connecting...")

            await self.connect()

            print("✓ Connected")
            print("✓ Recording started - speak now\n")

            # Run audio capture and transcription concurrently
            await asyncio.gather(
                self.send_audio(), self.receive_transcriptions()
            )

        except KeyboardInterrupt:
            print("\n\nStopped")
        except Exception as e:
            print(f"Error: {e}", file=sys.stderr)
        finally:
            await self.close()


async def main():
    transcriber = RealtimeTranscriber()
    await transcriber.run()


if __name__ == "__main__":
    asyncio.run(main())

Audio Translation

Audio translation converts speech from any language to English text.
response = client.audio.translations.create(
    file="french_audio.mp3",
    model="openai/whisper-large-v3",
)
print(f"English translation: {response.text}")
Translation with Context
response = client.audio.translations.create(
    file="business_meeting_spanish.mp3",
    model="openai/whisper-large-v3",
    prompt="This is a business meeting discussing quarterly sales results.",
)

Speaker Diarization

Enable diarization to identify who is speaking when. If known you can also add min_speakers and max_speakers expected in the audio to improve the diarization accuracy.
from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    diarize="true",  # Enable speaker diarization
    min_speakers=1,
    max_speakers=5,
)

# Access speaker segments
print(response.speaker_segments)
Example Response with Diarization:
AudioSpeakerSegment(
    id=1,
    speaker_id='SPEAKER_01',
    start=6.268,
    end=30.776,
    text=(
        "Hello. Oh, hey, Justin. How are you doing? ..."
    ),
    words=[
        AudioTranscriptionWord(
            word='Hello.',
            start=6.268,
            end=11.314,
            id=0,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='Oh,',
            start=11.834,
            end=11.894,
            id=1,
            speaker_id='SPEAKER_01'
        ),
        AudioTranscriptionWord(
            word='hey,',
            start=11.914,
            end=11.995,
            id=2,
            speaker_id='SPEAKER_01'
        ),
        ...
    ]
)

Word-level Timestamps

Get word-level timing information:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="word",
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")

## Access individual words with timestamps
if response.words:
    for word in response.words:
        print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")
Example Output:
Text
Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
Task: None

'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]

Response Formats

JSON Format (Default) Returns only the transcribed/translated text:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="json",
)

print(response.text)  # "Hello, this is a test recording."
Verbose JSON Format Returns detailed information including timestamps:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="segment",
)

## Access segments with timestamps
for segment in response.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
Example Output:
Text
[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...

[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.

[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...

[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.

[43.50s - 44.20s]: you

Advanced Features

Temperature Control Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    temperature=0.0,  # Most deterministic
)

print(f"Text: {response.text}")

Async Support

All transcription and translation operations support async/await: Async Transcription
import asyncio
from together import AsyncTogether


async def transcribe_audio():
    client = AsyncTogether()

    response = await client.audio.transcriptions.create(
        file="audio.mp3",
        model="openai/whisper-large-v3",
        language="en",
    )

    return response.text


## Run async function
result = asyncio.run(transcribe_audio())
print(result)
Async Translation
async def translate_audio():
    client = AsyncTogether()

    response = await client.audio.translations.create(
        file="foreign_audio.mp3",
        model="openai/whisper-large-v3",
    )

    return response.text


result = asyncio.run(translate_audio())
print(result)
Concurrent Processing Process multiple audio files concurrently:
import asyncio
from together import AsyncTogether


async def process_multiple_files():
    client = AsyncTogether()

    files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]

    tasks = [
        client.audio.transcriptions.create(
            file=file,
            model="openai/whisper-large-v3",
        )
        for file in files
    ]

    responses = await asyncio.gather(*tasks)

    for i, response in enumerate(responses):
        print(f"File {files[i]}: {response.text}")


asyncio.run(process_multiple_files())

Best Practices

Choosing the Right Method
  • Batch Transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case
  • Real-time Streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback
Audio Quality Tips
  • Use high-quality audio files for better transcription accuracy
  • Minimize background noise
  • Ensure clear speech with good volume levels
  • Use appropriate sample rates (16kHz or higher recommended)
  • For WebSocket streaming, use PCM format: pcm_s16le_16000
  • Consider file size limits for uploads
  • For long audio files, consider splitting into smaller chunks
  • Use streaming for real-time applications when available
Diarization Best Practices
  • Works best with clear audio and distinct speakers
  • Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
  • Use with verbose_json format to get segment-level speaker information
Next Steps