Skip to main content What are GPU Clusters?
Together GPU Clusters provide on-demand access to high-performance GPU infrastructure for training, fine-tuning, and running large-scale AI workloads. Create clusters in minutes with features like real-time scaling, persistent storage, and support for both Kubernetes and Slurm workload managers.
Concepts
Kubernetes Cluster Architecture
Each GPU cluster is built on Kubernetes, providing a robust container orchestration platform. The architecture includes:
Control Plane – Manages cluster state, scheduling, and API access
Worker Nodes – GPU-equipped nodes that run your workloads
Networking – High-speed InfiniBand for multi-node communication
Storage Layer – Persistent volumes, local NVMe, and shared storage
You interact with the cluster using standard Kubernetes tools like kubectl, or through higher-level abstractions like Slurm.
Slurm on Kubernetes via Slinky
For users preferring HPC-style workflows, Together runs Slurm on top of Kubernetes using Slinky , an integration layer that bridges traditional HPC scheduling with cloud-native infrastructure:
Slurm Controller – Runs as Kubernetes pods, managing job queues and scheduling
Login Nodes – SSH-accessible entry points for job submission
Compute Nodes – GPU workers registered with both Kubernetes and Slurm
Unified Resource Management – Slinky translates Slurm jobs into Kubernetes pods while maintaining Slurm’s familiar interface
This architecture gives you the simplicity of sbatch and srun commands while leveraging Kubernetes’ reliability, scalability, and ecosystem.
Key Features
Fast provisioning – Clusters ready in minutes, not hours or days
Flexible scaling – Scale up or down in real time to match workload demands
Persistent storage – Long-lived, resizable shared storage with high throughput
Multiple workload managers – Choose between Kubernetes or Slurm-on-Kubernetes
Full API access – Manage clusters via REST API, CLI, or Terraform
Enterprise integration – Works with SkyPilot and other orchestration tools
Available Hardware
Choose from the latest NVIDIA GPU configurations:
NVIDIA HGX B200 – Latest generation for maximum performance
NVIDIA HGX H200 – Enhanced memory for large models
NVIDIA HGX H100 SXM – High-bandwidth training and inference
NVIDIA HGX H100 SXM - Inference – Optimized for single-node inference workloads
All nodes feature high-speed InfiniBand networking for multi-node training (except inference-optimized variants).
Capacity Options
GPU Clusters offer two billing modes to match different workload patterns and budget requirements. You can choose Reserved capacity for predictable, sustained workloads with cost savings, or On-demand capacity for flexible, pay-as-you-go usage.
Reserved Capacity
Reserve GPU capacity upfront for a commitment period of 1-90 days at discounted rates.
How It Works:
Upfront payment – Credits are charged or deducted when the cluster is provisioned
Fixed duration – Reserve capacity for 1 to 90 days
Discounted pricing – Lower rates compared to on-demand
Automatic decommission – Clusters are decommissioned when the reservation expires
When to Use Reserved:
Predictable workloads where you know the duration
Multi-day training runs or experiments
Cost optimization with discounted rates
Planned workloads with specific commitments
Extending Reservations: To extend a reservation beyond its initial period, contact your account team.
On-demand Capacity
Pay only for what you use with hourly billing and no upfront commitment.
How It Works:
Hourly billing – Pay per hour of cluster runtime
No commitment – Terminate anytime without penalty
Flexible – Scale up and down as needed
Standard pricing – Higher per-hour rates than reserved capacity
When to Use On-demand:
Variable or unpredictable resource needs
Short-term experiments or development work
Exploratory testing before committing to longer runs
Temporary capacity needs beyond reserved baseline
Mixing Capacity Types
You can combine reserved and on-demand capacity in the same cluster for optimal cost and flexibility:
Start with reserved capacity for your baseline workload (e.g., reserve 8xH100 for 30 days)
Add on-demand capacity during peak periods (e.g., scale to 16xH100 temporarily)
Scale back down when burst period ends – on-demand capacity is removed, reserved capacity remains
Any usage beyond your reserved capacity is automatically billed at on-demand rates.
Choosing the Right Type
Choose Reserved if:
✓ You know the duration of your workload
✓ You’re running multi-day training or experiments
✓ Cost optimization is important
✓ You can commit to a specific period
Choose On-demand if:
✓ Your resource needs are unpredictable
✓ You’re running short experiments
✓ You need maximum flexibility
✓ You’re in development/testing phase
Mix Both if:
✓ You have a predictable baseline with occasional bursts
✓ You want cost savings on steady-state workload
✓ You need flexibility for peak periods
Storage
Clusters include multiple storage tiers:
Shared volumes – Multi-NIC bare metal storage with high throughput, persistent across cluster lifecycle
Local NVMe – Fast local disks on each node for temporary data
Shared /home – NFS-mounted home directories for code and configs
Storage can be dynamically resized as your data grows.
Learn more about storage →
Workload Management
Kubernetes
Use standard Kubernetes workflows with kubectl to:
Deploy pods and jobs
Manage persistent volumes
Access the Kubernetes Dashboard
Integrate with existing K8s tooling
Slurm
For HPC-style workflows, use Slurm with:
Direct SSH access to login nodes
Familiar commands (sbatch, srun, squeue)
Job arrays for distributed processing
Traditional batch scheduling
Learn more about Slurm →
Getting Started
Ready to create your first cluster?
Follow the Quickstart guide for step-by-step instructions
Review the Capacity Options above to choose the right billing mode
Check Pricing for current rates
Support
Capacity unavailable? Use the “Notify Me” option to get alerts when capacity comes online
Questions or custom requirements? Contact [email protected]