Kubernetes Usage
Usekubectl to interact with Kubernetes clusters for containerized workloads.
Deploy Pods with Storage
Create PersistentVolumeClaims to access shared and local storage. We provide a static PersistentVolume with the same name as your shared volume. As long as you use the static PV, your data will persist. Shared Storage PVC:Kubernetes Dashboard
Access the Kubernetes Dashboard for visual cluster management:- From the cluster UI, click the K8s Dashboard URL
- Retrieve your access token:
- Paste the token into the dashboard login
Slurm Direct SSH
For HPC workflows, Slurm clusters provide direct SSH access to login nodes.Prerequisites
- SSH key must be added to your account at api.together.ai/settings/ssh-key
- Keys must be added before cluster creation to register in the LDAP server
SSH to Login Pod
The cluster UI shows copy-ready Slurm commands tailored to your cluster. Use these to quickly verify connectivity and submit jobs. Hostnames:- Worker pods:
<node-name>.slurm.pod(e.g.,gpu-dp-hmqnh-nwlnj.slurm.pod) - Login pod: Always
slurm-login(where you’ll start most jobs)
Cluster Scaling
Clusters can scale flexibly in real time. Add on-demand compute to temporarily scale up when workload demand spikes, then scale back down as demand decreases. Scaling operations can be performed via:- Together Cloud UI
- tcloud CLI
- REST API
Cluster Autoscaling
Cluster Autoscaling automatically adjusts the number of nodes in your cluster based on workload demand using the Kubernetes Cluster Autoscaler. How It Works: The Kubernetes Cluster Autoscaler monitors your cluster and:- Scales up when pods are pending due to insufficient resources
- Scales down when nodes are underutilized for an extended period
- Respects constraints like minimum/maximum node counts and resource limits
- Navigate to GPU Clusters in the Together Cloud UI
- Click Create Cluster
- In the cluster configuration, toggle Enable Autoscaling
- Configure your maximum GPUs
- Create the cluster
Autoscaling works with both reserved and on-demand capacity. Scaling beyond reserved capacity will provision on-demand nodes at standard hourly rates.
Targeted Scale-down
To control which nodes are removed during scale-down:- Cordon the node(s) to prevent new workloads
- Add the annotation:
node.together.ai/delete-node-on-scale-down: "true" - Trigger scale-down via UI, CLI, or API
Storage Management
Clusters support long-lived, resizable shared storage with persistent data.Storage Tiers
All clusters include:- Shared volumes – Multi-NIC bare metal paths for high throughput
- Local NVMe disks – Fast local storage on each node
- Shared
/home– NFS-mounted from head node for code and configs
Upload Data
For small datasets:Resize Storage
Storage volumes can be dynamically resized as your data grows. Use the UI, CLI, or API to increase volume size. Learn more about storage options →Monitoring and Status
Check Cluster Health
From the UI:- View cluster status (Provisioning, Ready, Error)
- Monitor resource utilization
- Check node health indicators
Best Practices
Resource Management
- Use PersistentVolumes for shared data that persists across pod restarts
- Use local storage for temporary scratch space
- Set resource requests and limits in pod specs
Job Scheduling
- Use Kubernetes Jobs for batch processing
- Use Slurm job arrays for embarrassingly parallel workloads
- Set appropriate timeouts and retry policies
Data Management
- Download large datasets directly on the cluster (not via local machine)
- Use shared storage for training data and checkpoints
- Use local NVMe for temporary files during training
Scaling Strategy
- Start with reserved capacity for baseline workload
- Add on-demand capacity for burst periods
- Use targeted scale-down to control costs
Troubleshooting
Pods not scheduling
- Check node status:
kubectl get nodes - Verify resource requests don’t exceed available resources
- Check for taints on nodes:
kubectl describe node <node-name>
Storage mount issues
- Verify PVC is bound:
kubectl get pvc - Check volume name matches your shared volume
- Ensure storage class exists for local storage
Slurm jobs not running
- Check node status:
sinfo - Verify partition is available
- Check job status:
scontrol show job <jobid>