Learn how to transcribe and translate audio into text!
Together AI provides comprehensive audio transcription and translation capabilities powered by state-of-the-art speech recognition models including OpenAI’s Whisper and Voxtral. This guide covers everything from batch transcription to real-time streaming for low-latency applications.
Custom PromptsUse prompts to improve transcription accuracy for specific contexts:
Copy
Ask AI
response = client.audio.transcriptions.create( file="medical_consultation.mp3", model="openai/whisper-large-v3", language="en", prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options.",)
Send audio data in base64-encoded PCM format.2. Commit Audio Buffer
Copy
Ask AI
{ "type": "input_audio_buffer.commit"}
Forces transcription of any remaining audio in the server-side buffer.Server → Client MessagesDelta Events (Intermediate Results)
Copy
Ask AI
{ "type": "conversation.item.input_audio_transcription.delta", "delta": "The quick brown fox jumps"}
Delta events are intermediate transcriptions. The model is still processing and may revise the output. Each delta message overrides the previous delta.Completed Events (Final Results)
Copy
Ask AI
{ "type": "conversation.item.input_audio_transcription.completed", "transcript": "The quick brown fox jumps over the lazy dog"}
Completed events are final transcriptions. The model is confident about this text. The next delta event will continue from where this completed.Real-time Example
Copy
Ask AI
import asyncioimport websocketsimport jsonimport base64import osasync def transcribe_audio(): api_key = os.environ.get("TOGETHER_API_KEY") url = "wss://api.together.ai/v1/realtime?model=openai/whisper-large-v3&input_audio_format=pcm_s16le_16000" headers = {"Authorization": f"Bearer {api_key}"} async with websockets.connect(url, additional_headers=headers) as ws: # Read audio file with open("audio.wav", "rb") as f: audio_data = f.read() # Send audio in chunks with delay to simulate real-time chunk_size = 8192 bytes_per_second = 16000 * 2 # 16kHz * 2 bytes (16-bit) delay_per_chunk = chunk_size / bytes_per_second for i in range(0, len(audio_data), chunk_size): chunk = audio_data[i : i + chunk_size] base64_chunk = base64.b64encode(chunk).decode("utf-8") await ws.send( json.dumps( { "type": "input_audio_buffer.append", "audio": base64_chunk, } ) ) # Simulate real-time streaming if i + chunk_size < len(audio_data): await asyncio.sleep(delay_per_chunk) # Commit the audio buffer await ws.send(json.dumps({"type": "input_audio_buffer.commit"})) # Receive transcription results async for message in ws: data = json.loads(message) if ( data["type"] == "conversation.item.input_audio_transcription.delta" ): print(f"Partial: {data['delta']}") elif ( data["type"] == "conversation.item.input_audio_transcription.completed" ): print(f"Final: {data['transcript']}") break elif ( data["type"] == "conversation.item.input_audio_transcription.failed" ): error = data.get("error", {}) print(f"Error: {error.get('message')}") breakasyncio.run(transcribe_audio())
response = client.audio.translations.create( file="business_meeting_spanish.mp3", model="openai/whisper-large-v3", prompt="This is a business meeting discussing quarterly sales results.",)
response = client.audio.transcriptions.create( file="audio.mp3", model="openai/whisper-large-v3", response_format="verbose_json", timestamp_granularities="word",)print(f"Text: {response.text}")print(f"Language: {response.language}")print(f"Duration: {response.duration}s")## Access individual words with timestampsif response.words: for word in response.words: print(f"'{word.word}' [{word.start:.2f}s - {word.end:.2f}s]")
Example Output:
Text
Copy
Ask AI
Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.Language: enDuration: 7.2562358276643995sTask: None'It' [0.00s - 0.36s]'is' [0.42s - 0.47s]'certain' [0.51s - 0.74s]'that' [0.79s - 0.86s]'Jack' [0.90s - 1.11s]'Pumpkinhead' [1.15s - 1.66s]'might' [1.81s - 2.00s]'have' [2.04s - 2.13s]'had' [2.16s - 2.26s]'a' [2.30s - 2.32s]'much' [2.36s - 2.48s]'finer' [2.54s - 2.74s]'house' [2.78s - 2.93s]'to' [2.96s - 3.03s]'live' [3.07s - 3.21s]'in.' [3.26s - 7.27s]
JSON Format (Default)Returns only the transcribed/translated text:
Python
Copy
Ask AI
response = client.audio.transcriptions.create( file="audio.mp3", model="openai/whisper-large-v3", response_format="json",)print(response.text) # "Hello, this is a test recording."
Verbose JSON FormatReturns detailed information including timestamps:
Copy
Ask AI
response = client.audio.transcriptions.create( file="audio.mp3", model="openai/whisper-large-v3", response_format="verbose_json", timestamp_granularities="segment",)## Access segments with timestampsfor segment in response.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
Example Output:
Text
Copy
Ask AI
[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.[43.50s - 44.20s]: you