Skip to main content
POST
/
audio
/
speech
from together import Together
import os

client = Together(
api_key=os.environ.get("TOGETHER_API_KEY"),
)

response = client.audio.speech.create(
model="cartesia/sonic-2",
input="The quick brown fox jumps over the lazy dog.",
voice="laidback woman",
)

response.stream_to_file("audio.wav")
This response does not have an example.

Authorizations

Authorization
string
header
default:default
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
model
required

The name of the model to query.<br> <br> See all of Together AI's chat models The current supported tts models are: - cartesia/sonic - hexgrad/Kokoro-82M - canopylabs/orpheus-3b-0.1-ft

Available options:
cartesia/sonic,
hexgrad/Kokoro-82M,
canopylabs/orpheus-3b-0.1-ft
Example:

"canopylabs/orpheus-3b-0.1-ft"

input
string
required

Input text to generate the audio for

voice
string
required

The voice to use for generating the audio. The voices supported are different for each model. For eg - for canopylabs/orpheus-3b-0.1-ft, one of the voices supported is tara, for hexgrad/Kokoro-82M, one of the voices supported is af_alloy and for cartesia/sonic, one of the voices supported is "friendly sidekick". <br> <br> You can view the voices supported for each model using the /v1/voices endpoint sending the model name as the query parameter. View all supported voices here.

response_format
enum<string>
default:wav

The format of audio output. Supported formats are mp3, wav, raw if streaming is false. If streaming is true, the only supported format is raw.

Available options:
mp3,
wav,
raw
language
enum<string>
default:en

Language of input text.

Available options:
en,
de,
fr,
es,
hi,
it,
ja,
ko,
nl,
pl,
pt,
ru,
sv,
tr,
zh
response_encoding
enum<string>
default:pcm_f32le

Audio encoding of response

Available options:
pcm_f32le,
pcm_s16le,
pcm_mulaw,
pcm_alaw
sample_rate
number
default:44100

Sampling rate to use for the output audio. The default sampling rate for canopylabs/orpheus-3b-0.1-ft and hexgrad/Kokoro-82M is 24000 and for cartesia/sonic is 44100.

stream
boolean
default:false

If true, output is streamed for several characters at a time instead of waiting for the full response. The stream terminates with data: [DONE]. If false, return the encoded audio as octet stream

Response

OK

The response is of type file.