Skip to main content
Dedicated Containers provide a flexible way to run your own Dockerized workloads on managed GPU infrastructure. You supply the container image, and Together manages everything else—handling compute provisioning, autoscaling, networking, and observability for you. The platform is designed for teams that need full control over their runtime environment while avoiding the operational complexity of managing GPU clusters directly.
Looking for full example templates?
See our end-to-end deployment examples: Image Generation with Flux2 and Video Generation with Wan 2.1.
With Together Deployments, you can:
  • Deploy custom inference, data processing jobs, or long-running workers
  • Scale workloads automatically based on demand, including down to zero
  • Run queue-based or asynchronous jobs with built-in request handling
  • Securely manage secrets, environment variables, and configuration
  • Scale from a single replica to thousands of GPUs as traffic grows

Platform Components

Dedicated Containers Architecture
Dedicated Containers include three core components:

Jig – Deployment CLI

A lightweight CLI for building, pushing, and deploying containers. Jig handles:
  • Dockerfile generation from pyproject.toml
  • Image building and pushing to Together’s registry
  • Deployment creation and updates
  • Secrets and volume management
  • Log streaming and status monitoring
together beta jig deploy
See the Jig CLI docs →

Sprocket – Worker SDK

A Python SDK for building inference workers that integrate with Together’s job queue:
  • Implement setup() and predict(args) -> dict
  • Automatic file download and upload handling
  • Progress reporting for long-running jobs
  • Health checks and metrics endpoints
  • Graceful shutdown support
import sprocket


class MyModel(sprocket.Sprocket):
    def setup(self):
        self.model = load_model()

    def predict(self, args: dict) -> dict:
        result = self.model(args["input"])
        return {"output": result}


if __name__ == "__main__":
    sprocket.run(MyModel(), "my-org/my-model")
See the Sprocket SDK docs →

Container Registry

A Together-hosted Docker registry at registry.together.xyz for storing your container images. Images are private to your organization and referenced by digest for reproducible deployments.

Available Hardware

Choose from high-performance NVIDIA GPU configurations:
GPU Typegpu_type valueMemoryUse Case
NVIDIA H100 SXMh100-80gb80GBLarge models, high throughput
CPU-onlynoneLightweight preprocessing or embedding models
For models requiring multiple GPUs, configure gpu_count in your deployment and use torchrun for distributed inference.

When to Use Dedicated Containers

Dedicated Containers are appropriate when:
  • You have a custom model or inference stack – Custom architectures, fine-tuned models, or proprietary inference code
  • You’ve modified open-source engines – Customized vLLM, SGLang, or other serving frameworks
  • You’re running media generation – Audio, image, or video models with variable execution times
  • You need async or batch processing – Long-running jobs that don’t fit the request-response pattern
  • You want full control – Specific library versions, custom preprocessing, or non-standard runtimes

How It Works

  1. Package your model as a Docker container Create a container with your runtime, dependencies, and inference code. Use Sprocket for queue integration or bring your own HTTP server.
  2. Configure your deployment Define GPU type, replica limits, autoscaling behavior, and environment variables in pyproject.toml.
  3. Deploy to Together Run together beta jig deploy to build, push, and create your deployment. Together provisions GPUs and starts your containers.
  4. Submit jobs Use the Queue API to submit jobs. Workers pull jobs from the queue, execute inference, and report results.
  5. Monitor and scale View logs, metrics, and job status. The autoscaler adjusts replica count based on queue depth.
Ready to deploy? Follow the Quickstart guide for a step-by-step walkthrough, or explore the Jig CLI, Sprocket SDK, and Queue API docs.

Monitoring and Observability

Metrics

Each Sprocket worker exposes a /metrics endpoint with Prometheus-compatible metrics:
requests_inflight 1.0
The autoscaler uses this metric combined with queue depth to make scaling decisions.

Logging

Access deployment logs via:
together beta jig logs
together beta jig logs --follow
Structured Logging in Your Application Use Python’s logging module for structured output:
import logging
import sprocket

logging.basicConfig(
    level=logging.INFO,
    format="{levelname} {module}:{lineno}: {message}",
    style="{",
)
logger = logging.getLogger(__name__)


class MyModel(sprocket.Sprocket):
    def setup(self):
        logger.info("Loading model...")
        self.model = load_model()
        logger.info("Model loaded successfully")

    def predict(self, args):
        logger.info(
            f"Processing job with prompt: {args.get('prompt', '')[:50]}..."
        )
        # ...

Health Checks

The platform monitors your deployment’s /health endpoint. Ensure it:
  • Returns 200 when ready to accept jobs
  • Returns 503 during startup or when unhealthy
  • Responds within a reasonable timeout

Autoscaling

Configuration

Enable autoscaling in your pyproject.toml:
[tool.jig.deploy]
min_replicas = 1
max_replicas = 20

[tool.jig.autoscaling]
profile = "QueueBacklogPerWorker"
targetValue = "1.05"

Profiles

QueueBacklogPerWorker Scales based on queue depth relative to worker count.
  • targetValue = "1.0" - Exact match (queue_depth = workers)
  • targetValue = "1.05" - 5% overprovisioning (recommended)
  • targetValue = "0.9" - Aggressive scaling (more workers than needed)
Formula: desired_replicas = queue_depth / targetValue

Scaling Behavior

  1. Scale Up: When queue backlog exceeds target, new replicas are added
  2. Scale Down: When workers are idle, replicas are removed (respecting min_replicas)
  3. Graceful Shutdown: Workers complete current job before terminating

Troubleshooting

Common Issues

Container fails to start Symptoms: Deployment status shows “failed” or “error” Check:
  1. View logs: together beta jig logs
  2. Verify health endpoint works locally
  3. Check for missing environment variables
  4. Ensure sufficient memory allocated
Jobs stuck in pending Symptoms: Jobs submitted but never processed Check:
  1. Deployment status: together beta jig status
  2. Queue status: together beta jig queue_status
  3. Worker logs for errors: together beta jig logs --follow
  4. Verify --queue flag in startup command
Out of memory errors Symptoms: Container killed, OOM in logs Solutions:
  1. Increase memory in deployment config
  2. Use device_map="auto" for large models
  3. Enable gradient checkpointing if training
  4. Reduce batch size
Slow model loading Symptoms: Long startup time, health check timeouts Solutions:
  1. Use volumes for model weights (faster than downloading)
  2. Pre-download models in Dockerfile
  3. Increase health check timeout
GPU not detected Symptoms: torch.cuda.is_available() returns False Check:
  1. Verify gpu_count >= 1 in config
  2. Check CUDA compatibility with base image
  3. Ensure PyTorch is installed with CUDA support

Debug Mode

Enable debug logging:
export TOGETHER_DEBUG=1
together beta jig deploy

Getting Help

  • View deployment status: together beta jig status
  • Check queue: together beta jig queue_status
  • Stream logs: together beta jig logs --follow
  • Contact support with your deployment name and request IDs

FAQs

General Q: What’s the difference between Sprocket and a regular HTTP server? A: Sprocket integrates with Together’s managed job queue, providing automatic job distribution, status reporting, file handling, and graceful shutdown. Use Sprocket for batch/async workloads; use a regular HTTP server for low-latency request-response APIs. Q: Can I use my own Dockerfile? A: Yes. Set dockerfile = "Dockerfile" in your config and jig will use your custom Dockerfile instead of generating one. Q: How do I handle large model weights? A: Use volumes (together beta jig volumes create) to upload weights once, then mount them at runtime. This is faster than including weights in the container image. Scaling Q: How does autoscaling work? A: The autoscaler monitors queue depth and worker utilization. When queue backlog grows, it adds replicas. When workers are idle, it removes them (down to min_replicas). Q: What’s the maximum number of replicas? A: Set max_replicas in your config. The actual limit depends on your Together organization’s quota. Q: How long does scaling take? A: New replicas typically start within 1-2 minutes, depending on image size and model loading time. Jobs Q: How long can a job run? A: Default timeout is 5 minutes (TERMINATION_GRACE_PERIOD_SECONDS, default 300s). For longer jobs, increase this value in your deployment configuration. Q: What happens if a job fails? A: The job status is set to “failed” with error details. The worker remains healthy and continues processing other jobs. Q: Can I retry failed jobs? A: Resubmit the job with the same payload. Automatic retry is not currently supported. Billing Q: How am I billed? A: You’re billed for GPU-hours while replicas are running. Scale to zero (min_replicas = 0) when not in use to minimize costs. Q: Are there costs for the queue? A: Queue usage is included. You’re only billed for compute (running replicas).