Together AI provides comprehensive audio transcription and translation capabilities powered by OpenAI’s Whisper models. This guide covers everything you need to know to integrate speech-to-text functionality into your applications.

Table of Contents

Quick Start

  1. Register for an account
First, register for an account to get an API key. New accounts come with $1 to get started. Once you’ve registered, set your account’s API key to an environment variable named TOGETHER_API_KEY:
Shell
export TOGETHER_API_KEY=xxxxx
  1. Install your preferred library
Together provides an official library for Python and TypeScript:
pip install together
  1. Run your first transcription
Here’s how to get started with basic transcription and translation:
from together import Together

## Initialize the client
client = Together()

## Basic transcription
response = client.audio.transcriptions.create(
    file="path/to/audio.mp3",
    model="openai/whisper-large-v3",
    language="en"
)
print(response.text)

## Basic translation
response = client.audio.translations.create(
    file="path/to/foreign_audio.mp3",
    model="openai/whisper-large-v3"
)
print(response.text)

Audio Transcription

Audio transcription converts speech to text in the same language as the source audio.
from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting_recording.mp3",
    model="openai/whisper-large-v3",
    language="en",
    response_format="json"
)

print(f"Transcription: {response.text}")
The API supports the following audio formats:
  • .wav (audio/wav)
  • .mp3 (audio/mpeg)
  • .m4a (audio/mp4)
  • .webm (audio/webm)
  • .flac (audio/flac)

Input Methods

Local File Path
Python
response = client.audio.transcriptions.create(
    file="/path/to/audio.mp3",
    model="openai/whisper-large-v3"
)
Path Object
Python
from pathlib import Path

audio_file = Path("recordings/interview.wav")
response = client.audio.transcriptions.create(
    file=audio_file,
    model="openai/whisper-large-v3"
)
URL
Python
response = client.audio.transcriptions.create(
    file="https://example.com/audio.mp3",
    model="openai/whisper-large-v3"
)
File-like Object
Python
with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3"
    )

Language Support

Specify the audio language using ISO 639-1 language codes:
Python
response = client.audio.transcriptions.create(
    file="spanish_audio.mp3",
    model="openai/whisper-large-v3",
    language="es"  # Spanish
)
Common specifiable language codes:
  • “en” - English
  • “es” - Spanish
  • “fr” - French
  • “de” - German
  • “ja” - Japanese
  • “zh” - Chinese
  • “auto” - Auto-detect (default)

Custom Prompts

Use prompts to improve transcription accuracy for specific contexts:
response = client.audio.transcriptions.create(
    file="medical_consultation.mp3",
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options."
)

Audio Translation

Audio translation converts speech from any language to English text.
response = client.audio.translations.create(
    file="french_audio.mp3",
    model="openai/whisper-large-v3"
)
print(f"English translation: {response.text}")
Translation with Context
response = client.audio.translations.create(
    file="business_meeting_spanish.mp3",
    model="openai/whisper-large-v3",
    prompt="This is a business meeting discussing quarterly sales results."
)

Response Formats

JSON Format (Default) Returns only the transcribed/translated text:
Python
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="json"
)

print(response.text)  # "Hello, this is a test recording."
Verbose JSON Format Returns detailed information including timestamps:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="segment"
)

## Access segments with timestamps
for segment in response.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
Example Output:
Text
[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...

[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.

[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...

[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.

[43.50s - 44.20s]: you
Word-level Timestamps Get word-level timing information:
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="word"
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")
print(f"Task: {response.task}")

## Access individual words with timestamps
if response.words:
    for word in response.words:
        print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")
Example Output:
Text
Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
Task: None

'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]

Advanced Features

Temperature Control Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):
response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    temperature=0.0  # Most deterministic
)

Async Support

All transcription and translation operations support async/await:

Async Transcription

Python
import asyncio
from together import AsyncTogether

async def transcribe_audio():
    client = AsyncTogether()
    
    response = await client.audio.transcriptions.create(
        file="audio.mp3",
        model="openai/whisper-large-v3",
        language="en"
    )
    
    return response.text

## Run async function
result = asyncio.run(transcribe_audio())
print(result)

Async Translation

Python
async def translate_audio():
    client = AsyncTogether()
    
    response = await client.audio.translations.create(
        file="foreign_audio.mp3",
        model="openai/whisper-large-v3"
    )
    
    return response.text

result = asyncio.run(translate_audio())
print(result)

Concurrent Processing

Process multiple audio files concurrently:
Python
import asyncio
from together import AsyncTogether

async def process_multiple_files():
    client = AsyncTogether()
    
    files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
    
    tasks = [
        client.audio.transcriptions.create(
            file=file,
            model="openai/whisper-large-v3"
        )
        for file in files
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for i, response in enumerate(responses):
        print(f"File {files[i]}: {response.text}")

asyncio.run(process_multiple_files())

Best Practices

  • Use high-quality audio files for better transcription accuracy
  • Minimize background noise
  • Ensure clear speech with good volume levels
  • Use appropriate sample rates (16kHz or higher recommended)
  • Consider file size limits for uploads
  • For long audio files, consider splitting into smaller chunks
  • Use streaming for real-time applications when available

Next Steps