On this page
- Kubernetes Usage
- GPU Access in Containers
- Kubernetes Dashboard
- Direct SSH Access
- Cluster Scaling
- Monitoring and Status
- Best Practices
Kubernetes Usage
Usekubectl to interact with Kubernetes clusters for containerized workloads.
Deploy Pods with Storage
New to Kubernetes? A PersistentVolumeClaim (PVC) is a request for storage that your pods can use. Think of it like requesting a disk that persists even when pods restart.
Understanding Storage in Kubernetes
Kubernetes uses a three-step process for storage:- PersistentVolume (PV) - The actual storage resource (managed by Together AI)
- PersistentVolumeClaim (PVC) - Your request to use that storage (you create this)
- Pod with volumeMounts - Mounts the PVC into your container at a specific path (you create this)
Step 1: Create a PersistentVolumeClaim
Shared Storage PVC (Multi-Pod Access):accessModes: ReadWriteMany- Allows multiple pods across different nodes to mount this volume simultaneously (learn more)volumeName- Must match the exact name of your shared volume shown in the cluster UIstorage: 10Gi- The amount of storage you’re requesting
accessModes: ReadWriteOnce- Only one pod can mount this volume (typically for fast local NVMe storage)storageClassName- Specifies the type of storage to provision
shared-pvc.yaml, local-pvc.yaml) and apply:
STATUS: Bound for both PVCs.
Step 2: Create a Pod with Mounted Volumes
Now create a pod that mounts these volumes:volumeMounts.mountPath- The directory path inside your container where the volume will appearvolumes[].name- An internal identifier that connects the volume definition to the volumeMountpersistentVolumeClaim.claimName- Must exactly match the PVC name you created in Step 1
Step 3: Deploy and Access Your Pod
Save the pod definition to a file (e.g.,pod-with-storage.yaml) and deploy:
Step 4: Verify Mounted Volumes
Once inside the pod, verify your volumes are mounted:Accessing Volumes from Multiple Pods
Because the shared storage usesReadWriteMany, multiple pods can access it simultaneously:
Understanding GPU Access in Containers for Kubernetes Clusters
Our Kubernetes runtime exposes all GPU devices to all containers on the host. However, whether you can use tools likenvidia-smi inside your container depends on your container image.
Two scenarios:
-
Container with CUDA drivers (e.g.,
nvidia/cuda,pytorch/pytorch):- ✓ GPU devices are accessible
- ✓
nvidia-smiworks - ✓ CUDA libraries available
- Recommended for GPU workloads
-
Container without CUDA drivers (e.g.,
debian,ubuntubase images):- ✓ GPU devices are still exposed by the runtime
- ✗
nvidia-smicommand not found (CUDA drivers not installed in container) - ✗ Cannot run GPU workloads without installing CUDA
- GPU hardware is accessible, but you need CUDA software to use it
Key Concept: The container runtime makes GPU devices available, but the container image must include CUDA drivers and tools to interact with them. Think of it like having a GPU plugged in (runtime provides this) but needing drivers installed (image must provide this).
Kubernetes Dashboard
Access the Kubernetes Dashboard for visual cluster management:- From the cluster UI, click the K8s Dashboard URL
- Retrieve your access token:
- Paste the token into the dashboard login
Direct SSH Access
Prerequisites
- SSH key must be added to your account at api.together.ai/settings/ssh-key
SSH to GPU Worker Nodes (in Kubernetes) and Slurm Compute Nodes (Slurm)
You can SSH directly into any GPU worker node/ Slurm compute nodes from the cluster UI. From the UI:- Navigate to your cluster in the Together Cloud UI
- Go to the Worker Nodes section
- Find the node you want to access
- Click the Copy icon under the host column next to the node
- Paste and run the command in your terminal
- Check GPU utilization across all GPUs on the node with
nvidia-smi - Monitor node-level performance metrics (CPU, memory, disk, network)
- Inspect system logs (
journalctl,/var/log) - Debug node-level networking or storage issues
- Check Kubernetes kubelet status and logs
- View all processes running on the node
- In case of Slurm clusters you can directly run GPU workloads on the compute nodes via ssh
SSH to Slurm Login Nodes
For HPC workflows, Slurm clusters provide SSH access to login nodes for job submission. The cluster UI shows copy-ready Slurm commands tailored to your cluster. Use these to quickly verify connectivity and submit jobs. Hostnames:- Worker nodes:
<node-name>.slurm.pod(e.g.,gpu-dp-hmqnh-nwlnj.slurm.pod) - Login node: Always
slurm-login(where you’ll start most jobs)
Cluster Scaling
Clusters can scale flexibly in real time. Add on-demand compute to temporarily scale up when workload demand spikes, then scale back down as demand decreases. Scaling operations can be performed via:- Together Cloud UI
- tcloud CLI
- REST API
Cluster Autoscaling
Cluster Autoscaling automatically adjusts the number of nodes in your cluster based on workload demand using the Kubernetes Cluster Autoscaler. How It Works: The Kubernetes Cluster Autoscaler monitors your cluster and:- Scales up when pods are pending due to insufficient resources
- Scales down when nodes are underutilized for an extended period
- Respects constraints like minimum/maximum node counts and resource limits
- Navigate to GPU Clusters in the Together Cloud UI
- Click Create Cluster
- In the cluster configuration, toggle Enable Autoscaling
- Configure your maximum GPUs
- Create the cluster
Autoscaling works with both reserved and on-demand capacity. Scaling beyond reserved capacity will provision on-demand nodes at standard hourly rates.
Targeted Scale-down
To control which nodes are removed during scale-down:- Cordon the node(s) to prevent new workloads
- For Kubernetes:
kubectl cordon <node_to_cordon> - For Slurm:
sudo scontrol update NodeName=<node_name> State=drain Reason="<reason_for_cordoning>”
- For Kubernetes:
- Trigger scale-down via UI, CLI, or API
Storage Management
Clusters support long-lived, resizable shared storage with persistent data.Storage Tiers
All clusters include:- Shared volumes – Multi-NIC bare metal paths for high throughput
- Local NVMe disks – Fast local storage on each node
- Shared
/home– NFS-mounted from head node for code and configs
Upload Data
For small datasets:Resize Storage
Storage volumes can be dynamically resized as your data grows. Use the UI, CLI, or API to increase volume size. Learn more about storage options →Monitoring and Status
Check Cluster Health
From the UI:- View cluster status (Provisioning, Ready, Error)
- Monitor resource utilization
- Check node health indicators
Best Practices
Resource Management
- Use PersistentVolumes for shared data that persists across pod restarts
- Use local storage for temporary scratch space
- Set resource requests and limits in pod specs
Job Scheduling
- Use Kubernetes Jobs for batch processing
- Use Slurm job arrays for embarrassingly parallel workloads
- Set appropriate timeouts and retry policies
Data Management
- Download large datasets directly on the cluster (not via local machine)
- Use shared storage for training data and checkpoints
- Use local NVMe for temporary files during training
Scaling Strategy
- Start with reserved capacity for baseline workload
- Add on-demand capacity for burst periods
- Use targeted scale-down to control costs
GPU capacity not available
In case you do not see GPU capacity of the type you require in the api.together.ai cloud consile, you can request GPU capacity by going to the create cluster view, slecting your region and GPU capacity, type required and clicking on “Request” button. Please also, select the date from which you need the GPUs. We use these requests as input for our demand planning, and our team will reach out to you if and when that becomes avialable.Submitting a request for capacity does not guarantee fulfillment due to very high demand, we try our best to fulfil these requests based on available GPU capacity. In case you need guaranteed GPU capacity for fixed periods of time, please reach out to our team
Troubleshooting
Pods not scheduling
- Check node status:
kubectl get nodes - Verify resource requests don’t exceed available resources
- Check for taints on nodes:
kubectl describe node <node-name>
Storage mount issues
- Verify PVC is bound:
kubectl get pvc - Check volume name matches your shared volume
- Ensure storage class exists for local storage
Slurm jobs not running
- Check node status:
sinfo - Verify partition is available
- Check job status:
scontrol show job <jobid>