Skip to main content

Kubernetes Usage

Use kubectl to interact with Kubernetes clusters for containerized workloads.

Deploy Pods with Storage

Create PersistentVolumeClaims to access shared and local storage. We provide a static PersistentVolume with the same name as your shared volume. As long as you use the static PV, your data will persist. Shared Storage PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-pvc
spec:
  accessModes:
    - ReadWriteMany   # Multiple pods can read/write
  resources:
    requests:
      storage: 10Gi   # Requested size
  volumeName: <shared volume name>
Local Storage PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-pvc
spec:
  accessModes:
    - ReadWriteOnce   # Only one pod/node can mount at a time
  resources:
    requests:
      storage: 50Gi
  storageClassName: local-storage-class
Mount volumes in a pod:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: Never
  containers:
    - name: ubuntu
      image: debian:stable-slim
      command: ["/bin/sh", "-c", "sleep infinity"]
      volumeMounts:
        - name: shared-pvc
          mountPath: /mnt/shared
        - name: local-pvc
          mountPath: /mnt/local
  volumes:
    - name: shared-pvc
      persistentVolumeClaim:
        claimName: shared-pvc
    - name: local-pvc
      persistentVolumeClaim:
        claimName: local-pvc
Apply and connect:
kubectl apply -f manifest.yaml
kubectl exec -it test-pod -- bash

Kubernetes Dashboard

Access the Kubernetes Dashboard for visual cluster management:
  1. From the cluster UI, click the K8s Dashboard URL
  2. Retrieve your access token:
kubectl -n kubernetes-dashboard get secret \
  $(kubectl -n kubernetes-dashboard get secret | grep admin-user-token | awk '{print $1}') \
  -o jsonpath='{.data.token}' | base64 -d | pbcopy
  1. Paste the token into the dashboard login

Slurm Direct SSH

For HPC workflows, Slurm clusters provide direct SSH access to login nodes.

Prerequisites

SSH to Login Pod

The cluster UI shows copy-ready Slurm commands tailored to your cluster. Use these to quickly verify connectivity and submit jobs. Hostnames:
  • Worker pods: <node-name>.slurm.pod (e.g., gpu-dp-hmqnh-nwlnj.slurm.pod)
  • Login pod: Always slurm-login (where you’ll start most jobs)
Common Slurm commands:
sinfo          # View node and partition status
squeue         # View job queue
srun           # Run interactive jobs
sbatch         # Submit batch jobs
scancel        # Cancel jobs
VS Code Remote SSH SetupTo use VS Code with your Slurm cluster, configure SSH with a proxy jump host in your ~/.ssh/config:
# Keep connections alive
Host *
  ServerAliveInterval 60

# Together AI jump host (if applicable)
Host together-jump
  HostName <your-jump-host>
  User <your-username>

# Your Slurm login node
Host slurm-cluster
  HostName slurm-login
  ProxyJump together-jump
  User <your-username>
Then in VS Code’s Remote SSH extension, connect to slurm-cluster. The connection will automatically route through the jump host.
Learn more about Slurm →

Cluster Scaling

Clusters can scale flexibly in real time. Add on-demand compute to temporarily scale up when workload demand spikes, then scale back down as demand decreases. Scaling operations can be performed via:
  • Together Cloud UI
  • tcloud CLI
  • REST API

Cluster Autoscaling

Cluster Autoscaling automatically adjusts the number of nodes in your cluster based on workload demand using the Kubernetes Cluster Autoscaler. How It Works: The Kubernetes Cluster Autoscaler monitors your cluster and:
  • Scales up when pods are pending due to insufficient resources
  • Scales down when nodes are underutilized for an extended period
  • Respects constraints like minimum/maximum node counts and resource limits
When pods cannot be scheduled due to lack of resources, the autoscaler provisions additional nodes automatically. When nodes remain idle below a utilization threshold, they are safely drained and removed. Enabling Autoscaling:
  1. Navigate to GPU Clusters in the Together Cloud UI
  2. Click Create Cluster
  3. In the cluster configuration, toggle Enable Autoscaling
  4. Configure your maximum GPUs
  5. Create the cluster
Once enabled, the autoscaler runs continuously in the background, responding to workload changes without manual intervention.
Autoscaling works with both reserved and on-demand capacity. Scaling beyond reserved capacity will provision on-demand nodes at standard hourly rates.

Targeted Scale-down

To control which nodes are removed during scale-down:
  1. Cordon the node(s) to prevent new workloads
  2. Add the annotation: node.together.ai/delete-node-on-scale-down: "true"
  3. Trigger scale-down via UI, CLI, or API
Cordoned and annotated nodes are prioritized for deletion above all others.

Storage Management

Clusters support long-lived, resizable shared storage with persistent data.

Storage Tiers

All clusters include:
  • Shared volumes – Multi-NIC bare metal paths for high throughput
  • Local NVMe disks – Fast local storage on each node
  • Shared /home – NFS-mounted from head node for code and configs

Upload Data

For small datasets:
# Create a PVC and pod with your shared volume mounted
kubectl cp LOCAL_FILENAME POD_NAME:/data/
For large datasets: Schedule a pod on the cluster that downloads directly from S3 or your data source:
apiVersion: v1
kind: Pod
metadata:
  name: data-loader
spec:
  containers:
    - name: downloader
      image: amazon/aws-cli
      command: ["aws", "s3", "cp", "s3://bucket/data", "/mnt/shared/", "--recursive"]
      volumeMounts:
        - name: shared-storage
          mountPath: /mnt/shared
  volumes:
    - name: shared-storage
      persistentVolumeClaim:
        claimName: shared-pvc

Resize Storage

Storage volumes can be dynamically resized as your data grows. Use the UI, CLI, or API to increase volume size. Learn more about storage options →

Monitoring and Status

Check Cluster Health

From the UI:
  • View cluster status (Provisioning, Ready, Error)
  • Monitor resource utilization
  • Check node health indicators
From kubectl:
kubectl get nodes           # Node status
kubectl top nodes           # Resource usage
kubectl get pods --all-namespaces  # All running workloads
From Slurm:
sinfo                       # Node and partition status
squeue                      # Job queue
scontrol show node          # Detailed node info

Best Practices

Resource Management

  • Use PersistentVolumes for shared data that persists across pod restarts
  • Use local storage for temporary scratch space
  • Set resource requests and limits in pod specs

Job Scheduling

  • Use Kubernetes Jobs for batch processing
  • Use Slurm job arrays for embarrassingly parallel workloads
  • Set appropriate timeouts and retry policies

Data Management

  • Download large datasets directly on the cluster (not via local machine)
  • Use shared storage for training data and checkpoints
  • Use local NVMe for temporary files during training

Scaling Strategy

  • Start with reserved capacity for baseline workload
  • Add on-demand capacity for burst periods
  • Use targeted scale-down to control costs

Troubleshooting

Pods not scheduling

  • Check node status: kubectl get nodes
  • Verify resource requests don’t exceed available resources
  • Check for taints on nodes: kubectl describe node <node-name>

Storage mount issues

  • Verify PVC is bound: kubectl get pvc
  • Check volume name matches your shared volume
  • Ensure storage class exists for local storage

Slurm jobs not running

  • Check node status: sinfo
  • Verify partition is available
  • Check job status: scontrol show job <jobid>

What’s Next?