- Deploy custom inference, data processing jobs, or long-running workers
- Scale workloads automatically based on demand, including down to zero
- Run queue-based or asynchronous jobs with built-in request handling
- Securely manage secrets, environment variables, and configuration
- Scale from a single replica to thousands of GPUs as traffic grows
Platform Components
Jig – Deployment CLI
A lightweight CLI for building, pushing, and deploying containers. Jig handles:- Dockerfile generation from
pyproject.toml - Image building and pushing to Together’s registry
- Deployment creation and updates
- Secrets and volume management
- Log streaming and status monitoring
Sprocket – Worker SDK
A Python SDK for building inference workers that integrate with Together’s job queue:- Implement
setup()andpredict(args) -> dict - Automatic file download and upload handling
- Progress reporting for long-running jobs
- Health checks and metrics endpoints
- Graceful shutdown support
Container Registry
A Together-hosted Docker registry atregistry.together.xyz for storing your container images. Images are private to your organization and referenced by digest for reproducible deployments.
Available Hardware
Choose from high-performance NVIDIA GPU configurations:| GPU Type | gpu_type value | Memory | Use Case |
|---|---|---|---|
| NVIDIA H100 SXM | h100-80gb | 80GB | Large models, high throughput |
| CPU-only | none | — | Lightweight preprocessing or embedding models |
gpu_count in your deployment and use torchrun for distributed inference.
When to Use Dedicated Containers
Dedicated Containers are appropriate when:- You have a custom model or inference stack – Custom architectures, fine-tuned models, or proprietary inference code
- You’ve modified open-source engines – Customized vLLM, SGLang, or other serving frameworks
- You’re running media generation – Audio, image, or video models with variable execution times
- You need async or batch processing – Long-running jobs that don’t fit the request-response pattern
- You want full control – Specific library versions, custom preprocessing, or non-standard runtimes
How It Works
- Package your model as a Docker container Create a container with your runtime, dependencies, and inference code. Use Sprocket for queue integration or bring your own HTTP server.
-
Configure your deployment
Define GPU type, replica limits, autoscaling behavior, and environment variables in
pyproject.toml. -
Deploy to Together
Run
together beta jig deployto build, push, and create your deployment. Together provisions GPUs and starts your containers. - Submit jobs Use the Queue API to submit jobs. Workers pull jobs from the queue, execute inference, and report results.
- Monitor and scale View logs, metrics, and job status. The autoscaler adjusts replica count based on queue depth.
Monitoring and Observability
Metrics
Each Sprocket worker exposes a/metrics endpoint with Prometheus-compatible metrics:
Logging
Access deployment logs via:Health Checks
The platform monitors your deployment’s/health endpoint. Ensure it:
- Returns 200 when ready to accept jobs
- Returns 503 during startup or when unhealthy
- Responds within a reasonable timeout
Autoscaling
Configuration
Enable autoscaling in yourpyproject.toml:
Profiles
QueueBacklogPerWorker Scales based on queue depth relative to worker count.targetValue = "1.0"- Exact match (queue_depth = workers)targetValue = "1.05"- 5% overprovisioning (recommended)targetValue = "0.9"- Aggressive scaling (more workers than needed)
desired_replicas = queue_depth / targetValue
Scaling Behavior
- Scale Up: When queue backlog exceeds target, new replicas are added
- Scale Down: When workers are idle, replicas are removed (respecting
min_replicas) - Graceful Shutdown: Workers complete current job before terminating
Troubleshooting
Common Issues
Container fails to start Symptoms: Deployment status shows “failed” or “error” Check:- View logs:
together beta jig logs - Verify health endpoint works locally
- Check for missing environment variables
- Ensure sufficient memory allocated
- Deployment status:
together beta jig status - Queue status:
together beta jig queue_status - Worker logs for errors:
together beta jig logs --follow - Verify
--queueflag in startup command
- Increase
memoryin deployment config - Use
device_map="auto"for large models - Enable gradient checkpointing if training
- Reduce batch size
- Use volumes for model weights (faster than downloading)
- Pre-download models in Dockerfile
- Increase health check timeout
torch.cuda.is_available() returns False
Check:
- Verify
gpu_count >= 1in config - Check CUDA compatibility with base image
- Ensure PyTorch is installed with CUDA support
Debug Mode
Enable debug logging:Getting Help
- View deployment status:
together beta jig status - Check queue:
together beta jig queue_status - Stream logs:
together beta jig logs --follow - Contact support with your deployment name and request IDs
FAQs
General Q: What’s the difference between Sprocket and a regular HTTP server? A: Sprocket integrates with Together’s managed job queue, providing automatic job distribution, status reporting, file handling, and graceful shutdown. Use Sprocket for batch/async workloads; use a regular HTTP server for low-latency request-response APIs. Q: Can I use my own Dockerfile? A: Yes. Setdockerfile = "Dockerfile" in your config and jig will use your custom Dockerfile instead of generating one.
Q: How do I handle large model weights?
A: Use volumes (together beta jig volumes create) to upload weights once, then mount them at runtime. This is faster than including weights in the container image.
Scaling
Q: How does autoscaling work?
A: The autoscaler monitors queue depth and worker utilization. When queue backlog grows, it adds replicas. When workers are idle, it removes them (down to min_replicas).
Q: What’s the maximum number of replicas?
A: Set max_replicas in your config. The actual limit depends on your Together organization’s quota.
Q: How long does scaling take?
A: New replicas typically start within 1-2 minutes, depending on image size and model loading time.
Jobs
Q: How long can a job run?
A: Default timeout is 5 minutes (TERMINATION_GRACE_PERIOD_SECONDS, default 300s). For longer jobs, increase this value in your deployment configuration.
Q: What happens if a job fails?
A: The job status is set to “failed” with error details. The worker remains healthy and continues processing other jobs.
Q: Can I retry failed jobs?
A: Resubmit the job with the same payload. Automatic retry is not currently supported.
Billing
Q: How am I billed?
A: You’re billed for GPU-hours while replicas are running. Scale to zero (min_replicas = 0) when not in use to minimize costs.
Q: Are there costs for the queue?
A: Queue usage is included. You’re only billed for compute (running replicas).