A high level overview of video generation parameters and when to use them. For parameters tied to reference images, keyframes, audio input, or video editing, see Capability-specific parameters at the bottom. For the complete schema, including every supported field along with its types and ranges, see the video generation API reference.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Available parameters vary by model. Wan 2.7 models use
resolution and ratio instead of width and height. Kling requires keyframe images via media.frame_images instead of a prompt. See the supported models table for per-model coverage.Quick reference
Match the problem you’re solving to the parameter most likely to help.- Video doesn’t match the prompt: Make the prompt more specific, add a
negative_promptfor what to exclude, or raiseguidance_scaletoward9-10. - Output looks oversaturated or has weird motion: Lower
guidance_scaleto6-7. Avoid values above12. - Poor visual quality: Raise
stepsto30-40for production runs. Diminishing returns past50. - Generation is too slow or expensive while iterating: Lower
stepsto10for quick previews, and shortenseconds. - Need the same video every run (evals, regression tests): Set
seedto a fixed integer. - Wrong dimensions or aspect ratio: Set
widthandheightexplicitly. On Wan 2.7, setresolutionandratioinstead. - Output file is too large: Raise
output_quality(higher number means more compression). Lower it for higher fidelity. - Need consistent characters or style across the video: Pass
media.reference_images. See Reference images and keyframes. - Need to pin starting or ending frames: Pass
media.frame_images. See Reference images and keyframes. - Need lip sync or beat-matched motion: Pass
media.audio_inputs. See Video audio input.
Prompting
prompt
A description of the video to generate. Required for every model except Kling. Maximum length is 32,000 characters. Be specific about subject, action, setting, camera movement, and pacing. Vague prompts produce generic motion. Include verbs and temporal cues (“slowly pans”, “the camera tracks left”) since video models are sensitive to motion language. Typical default: required.negative_prompt
A description of what to avoid in the generated video. Useful for excluding common artifacts. Set it when the model produces unwanted elements (extra limbs, flickering, watermarks). A reasonable starting point:"blurry, low quality, distorted, flickering".
Typical default: unset.
Output dimensions
width and height
The size of the generated video in pixels. Available combinations differ by model. Typical default:1366 x 768.
resolution
A resolution tier used by Wan 2.7 models in place ofwidth and height. Accepts "720P" or "1080P".
Typical default: "1080P".
ratio
The aspect ratio used by Wan 2.7 models. Accepts"16:9", "9:16", "1:1", "4:3", or "3:4".
Typical default: "16:9".
Length and frame rate
seconds
Clip duration in seconds. Accepted range is"1" through "10". Passed as a string.
Longer clips cost more and take longer to generate. Use shorter clips while iterating on prompts and parameters.
Typical default: "6".
fps
Frames per second. Higher values produce smoother motion at the cost of generation time and file size. Typical default:24 (some models accept up to 60).
Quality and speed
steps
The number of denoising steps. More steps generally improve visual quality and temporal consistency at a near-linear cost in latency. Past a model-specific point, additional steps stop helping. Lower it (10) for quick previews. Use 20 for a balanced default. Raise it (30-40) for production runs. Avoid values above 50. Range: 10-50.
Typical default: model-specific.
guidance_scale
Controls how closely the video follows the prompt. Higher values make the model adhere more strictly to the text description. Lower values give the model more creative freedom. Affects both visual content and temporal consistency. Recommended range is6.0-10.0. Values above 12 may cause over-guidance artifacts or unnatural motion.
6.0-7.0: More creative, less literal.7.0-9.0: Sweet spot for most use cases.9.0-10.0: Strict adherence to the prompt.
Reproducibility
seed
An integer that fixes the random initialization. With the sameseed, prompt, model, and parameters, the model returns the same video. Useful for reproducibility and for fair comparisons when tuning other parameters.
Typical default: unset (each call returns a new video).
Output format
output_format
The encoded video format. Accepts"MP4" or "WEBM". MP4 is the broadest-compatible default. WEBM produces smaller files but isn’t supported by every player.
Typical default: "MP4".
output_quality
Compression quality. Lower values produce higher fidelity and larger files. Higher values produce smaller files with more compression artifacts. Typical default:20.
Audio
generate_audio
Whether the model should generate audio for the video. Only applies to models that support audio generation. Typical default:false.
Capability-specific parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
prompt | string | Text description of the video to generate. | Required |
model | string | Model identifier. | Required |
width | integer | Video width in pixels. | 1366 |
height | integer | Video height in pixels. | 768 |
seconds | string | Length of video (1-10). | "6" |
fps | integer | Frames per second. | 15-60 |
steps | integer | Diffusion steps (higher = better quality, slower). | 10-50 |
guidance_scale | float | How closely to follow prompt. | 6.0-10.0 |
seed | integer | Random seed for reproducibility. | any |
output_format | string | Video format (MP4, WEBM). | MP4 |
output_quality | integer | Bitrate/quality (lower = higher quality). | 20 |
negative_prompt | string | What to avoid in generation. | - |
frame_images | array | Keyframe images for video generation. If size 1, starting frame; if size 2, starting and ending frame; if more than 2, frame must be specified per image. | |
resolution | string | Video resolution tier (720P, 1080P). Used by Wan 2.7 models instead of width/height. | "1080P" |
ratio | string | Aspect ratio (16:9, 9:16, 1:1, 4:3, 3:4). Used by Wan 2.7 models. | "16:9" |
media | object | Media inputs for the request (see schema and compatibility below). | - |
media object schema is documented in the next subsection.
media.frame_images: Pin specific frames to known images (keyframes). See Reference images and keyframes.media.reference_imagesandmedia.reference_videos: Steer visual style with references that should appear consistently across the video. See Reference images and keyframes.media.audio_inputs: Drive generation with an audio file for lip sync, beat-matched motion, or narration. See Video audio input.media.source_videoandmedia.frame_videos: Edit or extend an existing clip. Wan 2.7 specific. See the Wan 2.7 quickstart.
The top-level
frame_images and reference_images parameters are deprecated. Use media.frame_images and media.reference_images instead.media object schema
Themedia object is the unified way to pass images, videos, and audio into video generation requests.
| Field | Type | Description |
|---|---|---|
frame_images | array | Keyframe images for I2V. Each item: {input_image, frame} where frame is "first" or "last". |
frame_videos | array | Input video clips for video continuation (I2V). Each item: {video: "url"}. |
reference_images | array | Reference images for character or object consistency (R2V) or visual guidance (Video Edit). |
reference_videos | array | Reference videos for character or object consistency (R2V). Each item: {video: "url"}. |
source_video | string | Source video URL to edit (Video Edit). |
audio_inputs | array | Audio file URLs to drive generation (lip sync, beat-matched motion, etc.) for T2V and I2V. Each item: "url". WAV or MP3, 3-30s, up to 15 MB. |
media fields are supported on every model. See the Wan 2.7 quickstart for field compatibility across Wan 2.7 models.
See also
- Video generation overview: generate a video and poll for completion.
- Reference images and keyframes: guide visual style and pin specific frames.
- Video audio input: drive generation with an audio file.