Speech-to-Text

Learn how to transcribe and translate audio into text!

Together AI provides comprehensive audio transcription and translation capabilities powered by OpenAI's Whisper models. This guide covers everything you need to know to integrate speech-to-text functionality into your applications.

Table of Contents

Quick Start

  1. Register for an account

First, register for an account to get an API key. New accounts come with $1 to get started.

Once you've registered, set your account's API key to an environment variable named TOGETHER_API_KEY:

export TOGETHER_API_KEY=xxxxx
  1. Install your preferred library

Together provides an official library for Python and TypeScript:

pip install together
npm install together-ai
  1. Run your first transcription

Here's how to get started with basic transcription and translation:

from together import Together

# Initialize the client
client = Together()

# Basic transcription
response = client.audio.transcriptions.create(
    file="path/to/audio.mp3",
    model="openai/whisper-large-v3",
    language="en"
)
print(response.text)

# Basic translation
response = client.audio.translations.create(
    file="path/to/foreign_audio.mp3",
    model="openai/whisper-large-v3"
)
print(response.text)
import Together from 'together-ai';

const together = new Together();

// Basic transcription
const transcription = await together.audio.transcriptions.create({
  file: 'path/to/audio.mp3',
  model: 'openai/whisper-large-v3',
  language: 'en',
});
console.log(transcription.text);

// Basic translation
const translation = await together.audio.translations.create({
  file: 'path/to/foreign_audio.mp3',
  model: 'openai/whisper-large-v3',
});
console.log(translation.text);
# Transcription
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -F "[email protected]" \
     -F "model=openai/whisper-large-v3" \
     -F "language=en"

# Translation
curl -X POST "https://api.together.xyz/v1/audio/translations" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -F "file=@foreign_audio.mp3" \
     -F "model=openai/whisper-large-v3"
# Transcription
together audio transcribe audio.mp3 \
  --model openai/whisper-large-v3 \
  --language en

# Translation
together audio translate foreign_audio.mp3 \
  --model openai/whisper-large-v3

Audio Transcription

Audio transcription converts speech to text in the same language as the source audio.

from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file="meeting_recording.mp3",
    model="openai/whisper-large-v3",
    language="en",
    response_format="json"
)

print(f"Transcription: {response.text}")
import Together from 'together-ai';

const together = new Together();

const response = await together.audio.transcriptions.create({
  file: 'meeting_recording.mp3',
  model: 'openai/whisper-large-v3',
  language: 'en',
  response_format: 'json',
});

console.log(`Transcription: ${response.text}`);
together audio transcribe meeting_recording.mp3 \
  --model openai/whisper-large-v3 \
  --language en \
  --response-format json

The API supports the following audio formats:

  • .wav (audio/wav)
  • .mp3 (audio/mpeg)
  • .m4a (audio/mp4)
  • .webm (audio/webm)
  • .flac (audio/flac)

Input Methods

Local File Path

response = client.audio.transcriptions.create(
    file="/path/to/audio.mp3",
    model="openai/whisper-large-v3"
)

Path Object

from pathlib import Path

audio_file = Path("recordings/interview.wav")
response = client.audio.transcriptions.create(
    file=audio_file,
    model="openai/whisper-large-v3"
)

URL

response = client.audio.transcriptions.create(
    file="https://example.com/audio.mp3",
    model="openai/whisper-large-v3"
)

File-like Object

with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3"
    )

Language Support

Specify the audio language using ISO 639-1 language codes:

response = client.audio.transcriptions.create(
    file="spanish_audio.mp3",
    model="openai/whisper-large-v3",
    language="es"  # Spanish
)

Common specifiable language codes:

  • "en" - English
  • "es" - Spanish
  • "fr" - French
  • "de" - German
  • "ja" - Japanese
  • "zh" - Chinese
  • "auto" - Auto-detect (default)

Custom Prompts

Use prompts to improve transcription accuracy for specific contexts:

response = client.audio.transcriptions.create(
    file="medical_consultation.mp3",
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options."
)
together audio transcribe medical_consultation.mp3 \
  --model openai/whisper-large-v3 \
  --language en \
  --prompt "This is a medical consultation discussing patient symptoms, diagnosis, and treatment options."

Audio Translation

Audio translation converts speech from any language to English text.

response = client.audio.translations.create(
    file="french_audio.mp3",
    model="openai/whisper-large-v3"
)
print(f"English translation: {response.text}")
const response = await together.audio.translations.create({
  file: 'french_audio.mp3',
  model: 'openai/whisper-large-v3',
});
console.log(`English translation: ${response.text}`);
together audio translate french_audio.mp3 \
  --model openai/whisper-large-v3

Translation with Context

response = client.audio.translations.create(
    file="business_meeting_spanish.mp3",
    model="openai/whisper-large-v3",
    prompt="This is a business meeting discussing quarterly sales results."
)
together audio translate business_meeting_spanish.mp3 \
  --model openai/whisper-large-v3 \
  --prompt "This is a business meeting discussing quarterly sales results."

Response Formats

JSON Format (Default)

Returns only the transcribed/translated text:

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="json"
)

print(response.text)  # "Hello, this is a test recording."

Verbose JSON Format

Returns detailed information including timestamps:

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="segment"
)

# Access segments with timestamps
for segment in response.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
together audio transcribe audio.mp3 \
  --model openai/whisper-large-v3 \
  --response-format verbose_json \
  --timestamp-granularities segment \
  --pretty

Example Output:

[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...

[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.

[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...

[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.

[43.50s - 44.20s]: you

Word-level Timestamps

Get word-level timing information:

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
    timestamp_granularities="word"
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")
print(f"Task: {response.task}")

# Access individual words with timestamps
if response.words:
    for word in response.words:
        print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")
together audio transcribe audio.mp3 \
  --model openai/whisper-large-v3 \
  --response-format verbose_json \
  --timestamp-granularities word \
  --pretty

Example Output:

Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
Task: None

'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]

Advanced Features

Temperature Control

Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):

response = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    temperature=0.0  # Most deterministic
)
together audio transcribe audio.mp3 \
  --model openai/whisper-large-v3 \
  --temperature 0.0

Async Support

All transcription and translation operations support async/await:

Async Transcription

import asyncio
from together import AsyncTogether

async def transcribe_audio():
    client = AsyncTogether()
    
    response = await client.audio.transcriptions.create(
        file="audio.mp3",
        model="openai/whisper-large-v3",
        language="en"
    )
    
    return response.text

# Run async function
result = asyncio.run(transcribe_audio())
print(result)

Async Translation

async def translate_audio():
    client = AsyncTogether()
    
    response = await client.audio.translations.create(
        file="foreign_audio.mp3",
        model="openai/whisper-large-v3"
    )
    
    return response.text

result = asyncio.run(translate_audio())
print(result)

Concurrent Processing

Process multiple audio files concurrently:

import asyncio
from together import AsyncTogether

async def process_multiple_files():
    client = AsyncTogether()
    
    files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
    
    tasks = [
        client.audio.transcriptions.create(
            file=file,
            model="openai/whisper-large-v3"
        )
        for file in files
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for i, response in enumerate(responses):
        print(f"File {files[i]}: {response.text}")

asyncio.run(process_multiple_files())

Best Practices

  • Use high-quality audio files for better transcription accuracy
  • Minimize background noise
  • Ensure clear speech with good volume levels
  • Use appropriate sample rates (16kHz or higher recommended)
  • Consider file size limits for uploads
  • For long audio files, consider splitting into smaller chunks
  • Use streaming for real-time applications when available

Next Steps