Audio input for videos - Together AI docs

For models that support audio-driven generation (such as Wan 2.7 T2V), pass an audio file via the media.audio_inputs field. The model synchronizes the generated video to the audio, which is useful for lip sync, beat-matched motion, or narration-driven scenes.

import time
from together import Together

client = Together()

job = client.videos.create(
    prompt="A cartoon kitten general in golden armor stands on a cliff, commanding an army",
    model="Wan-AI/wan2.7-t2v",
    resolution="720P",
    ratio="16:9",
    seconds="10",
    media={
        "audio_inputs": ["https://download.samplelib.com/mp3/sample-3s.mp3"]
    },
)

print(f"Job ID: {job.id}")

# Poll until completion
while True:
    status = client.videos.retrieve(job.id)
    print(f"Status: {status.status}")

    if status.status == "completed":
        print(f"Video URL: {status.outputs.video_url}")
        break
    elif status.status == "failed":
        print("Video generation failed")
        break

    time.sleep(60)

If no audio is provided, the model automatically generates matching background music or sound effects based on the video content.

Audio constraints: WAV or MP3 format, 3 to 30 seconds, up to 15 MB. Audio longer than the video is truncated; audio shorter than the video leaves the remaining portion silent.

Reference images and keyframes

Video generation parameters

⌘I

Documentation Index