torchrun for distributed inference across multiple GPUs and deploy it to Together’s managed infrastructure.
What You’ll Learn
- Deploying multi-GPU models with Sprocket and Jig
- Using
use_torchrun=Truefor distributed inference - Automatic file upload with
FileOutput - Submitting jobs via the Queue API and polling for results
Prerequisites
- Together API Key – Get one from together.ai
- Dedicated Containers access – Contact [email protected] to enable for your organization
- Docker – For building container images. Install Docker
- Together CLI – Install with
pip install together --upgradeoruv tool install together
Overview
This example deploys a Wan 2.1 text-to-video model as a Dedicated Container with multi-GPU support. The Sprocket worker handles distributed inference across 2 GPUs, and Together manages provisioning, autoscaling, and observability. Output specs:- Resolution: 480×832
- Frames: 81 (5.4 seconds at 15fps)
- Format: MP4
- Video generation requires significant VRAM for temporal attention
- Context parallelism splits the sequence dimension across GPUs
- 2x H100 allows comfortable generation without memory pressure
How It Works
- Build – Jig builds a Docker image from your
pyproject.tomlconfiguration - Push – The image is pushed to Together’s private container registry
- Deploy – Together provisions 2x H100 GPUs and starts your container
- Torchrun – Sprocket’s
use_torchrun=Truelaunches child processes (one per GPU) - Queue – Jobs are submitted to the managed queue, broadcast to all GPU ranks, and processed in parallel
Project Structure
Implementation
Sprocket Worker Code
Configuration
Key Concepts
How use_torchrun=True Works
When you call sprocket.run(..., use_torchrun=True), Sprocket handles multi-GPU orchestration automatically.
Flow:
- Parent process receives a job from Together’s queue
- Job payload is broadcast to all child processes via Unix socket
- Each rank executes
setup()once at startup, thenpredict()for each job - Ranks synchronize via NCCL during forward pass
- Only rank 0 saves output and returns result
- Parent uploads
FileOutputand reports job completion
Distributed Process Initialization
Each worker process must initialize its distributed context before loading the model:torchrun, which sets RANK, LOCAL_RANK, WORLD_SIZE, and other environment variables.
Rank 0 Output Pattern
In distributed inference, only rank 0 should handle I/O and return results:- Avoids duplicate file writes
- Reduces memory on non-rank-0 GPUs (tensor output vs PIL)
- Sprocket collects output from rank 0 only
Automatic File Upload with FileOutput
Wrapping a path in FileOutput triggers automatic upload:
- Sprocket detects the
FileOutputin the response - Uploads the file to Together’s storage
- Replaces
FileOutputwith the public URL in the final response
Multi-GPU Configuration
For multi-GPU deployments, configuregpu_count in your deployment settings and use torchrun in your startup command:
use_torchrun=True to sprocket.run(), Sprocket handles the coordination between the parent process and GPU workers automatically.
Deployment
Deploy
Check Deployment Status
running and replicas are ready before submitting jobs.
Submit Jobs
Jobs are submitted to the managed queue and processed asynchronously. Video generation typically takes 30-75 seconds depending on settings.Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | string | Required | Text description of the video to generate |
num_inference_steps | int | 30 | Number of denoising steps (higher = better quality, slower) |
Output
When the job completes, the status response contains:url: Public URL to the generated MP4 video file (480×832, 81 frames, 15fps)
Scaling to More GPUs
To scale for higher throughput, increasemax_replicas to add more workers:
min_replicas = 0 (saves costs but adds cold start latency).
Cleanup
When you’re done, delete the deployment:Next Steps
- Image Generation Example – Single-GPU inference with Flux2
- Quickstart – Deploy your first container in 20 minutes
- Sprocket SDK – Full SDK reference for workers
- Jig CLI Reference – CLI commands and configuration options
- Deployments API Reference – REST API for deployments, secrets, storage, and queues