Overview

Instant Clusters allows you to create high-performance GPU clusters in minutes. With features like on-demand scaling, long-lived resizable high-bandwidth shared DC-local storage, Kubernetes and Slurm cluster flavors, a REST API, and Terraform support, you can run workloads flexibly without complex infrastructure management.

Quickstart: Create an Instant Cluster

  1. Log into api.together.ai.
  2. Click GPU Clusters in the top navigation menu.
  3. Click Create Cluster.
  4. Choose whether you want Reserved capacity or On-demand, based on your needs.
  5. Select the cluster size, for example 8xH100.
  6. Enter a cluster name.
  7. Choose a cluster type either Kubernetes or Slurm.
  8. Select a region.
  9. Choose the reservation duration for your cluster.
  10. Create and name your shared volume (minimum size 1 TiB).
  11. Optional: Select your NVIDIA driver and CUDA versions.
  12. Click Proceed.
Your cluster will now be ready for you to use.

Capacity Types

  • Reserved: You pay upfront to reserve GPU capacity for a duration between 1-90 days.
  • On-demand: You pay as you go for GPU capacity on an hourly basis. No pre-payment or reservation needed, and you can terminate your cluster at any time.

Node Types

We have the following node types available in Instant Clusters.
  • NVIDIA HGX B200
  • NVIDIA HGX H200
  • NVIDIA HGX H100 SXM
  • NVIDIA HGX H100 SXM - Inference (lower Infiniband multi-node GPU-to-GPU bandwidth, suitable for single-node inference)
If you don’t see an available node type, select the “Notify Me” option to get notified when capacity is online. You can also contact us with your request via [email protected].

Pricing

Pricing information for different GPU node types can be found here.

Cluster Status

  • From the UI, verify that your cluster transitions to Ready.
  • Monitor progress and health indicators directly from the cluster list.

Start Training with Kubernetes

Install kubectl

Install kubectl in your environment, for example on MacOS.

Download kubeconfig

From the Instant Clusters UI, download the kubeconfig and copy it to your local machine:
~/.kube/together_instant.kubeconfig
export KUBECONFIG=$HOME/.kube/together_instant.kubeconfig
kubectl get nodes
You can rename the file to config, but back up your existing config first.

Verify Connectivity

kubectl get nodes
You should see all worker and control plane nodes listed.

Deploy a Pod with Storage

  • Create a PersistentVolumeClaim for shared storage.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-pvc
spec:
  accessModes:
    - ReadWriteMany   # Multiple pods can read/write
  resources:
    requests:
      storage: 10Gi   # Requested size
  storageClassName: shared-storage-class
  • Create a PersistentVolumeClaim for local storage.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-pvc
spec:
  accessModes:
    - ReadWriteOnce   # Only one pod/node can mount at a time
  resources:
    requests:
      storage: 50Gi
  storageClassName: local-storage-class
  • Mount them into a pod:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: Never
  containers:
    - name: ubuntu
      image: debian:stable-slim
      command: ["/bin/sh", "-c", "sleep infinity"]
      volumeMounts:
        - name: shared-pvc
          mountPath: /mnt/shared
        - name: local-pvc
          mountPath: /mnt/local
  volumes:
    - name: shared-pvc
      persistentVolumeClaim:
        claimName: shared-pvc
    - name: local-pvc
      persistentVolumeClaim:
        claimName: local-pvc
Apply and connect:
kubectl apply -f manifest.yaml
kubectl exec -it test-pod -- bash

Kubernetes Dashboard Access

  • From the cluster UI, click the K8s Dashboard URL.
  • Retrieve your access token using the following command:
kubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user-token | awk '{print $1}') -o jsonpath='{.data.token}' | base64 -d | pbcopy

Cluster Scaling

Clusters can scale flexibly in real time. By adding on-demand compute to your cluster, you can temporarily scale up to more GPUs when workload demand spikes and then scale back down as it wanes. Scaling up or down can be performed using the UI, tcloud CLI, or REST API.

Targeted Scale-down

If you wish to mark which node or nodes should be targeted for scale-down, you can:
  • Either cordon the k8s node(s) or add the node.together.ai/delete-node-on-scale-down: “true” annotation to the k8s node(s). = Then trigger scale-down via the cloud console UI (or CLI, REST API).
  • Instant Clusters will ensure that cordoned + annotated nodes are prioritized for deletion above all others.

Storage Management

Instant Clusters supports long-lived, resizable in-DC shared storage. You can dynamically create and attach volumes to your cluster at cluster creation time, and resize as your data grows. All shared storage is backed by multi-NIC bare metal paths, ensuring high-throughput and low-latency performance for shared storage.

Upload Your Data

To upload data to the cluster from your local machine, follow these steps:
  • Create a PVC using the shared-rdma storage class as well as a pod to mount the volume
  • Run kubectl cp LOCAL_FILENAME YOUR_POD_NAME:/data/
  • Note: This method is suitable for smaller datasets, for larger datasets we recommend scheduling a pod on the cluster that can download from S3.

Compute Access

You can run workloads on Instant Clusters using Kubernetes or Slurm-on-Kubernetes.

Kubernetes

Use kubectl to submit jobs, manage pods, and interact with your cluster. See Quickstart for setup details.

Slurm Direct SSH

For HPC workflows, you can enable Slurm-on-Kubernetes:
  • Directly SSH into a Slurm node.
  • Use familiar Slurm commands (sbatch, srun, etc.) to manage distributed training jobs.
This provides the flexibility of traditional HPC job scheduling alongside Kubernetes.

APIs and Integrations

tcloud CLI

Download the CLI: Authenticate via Google SSO:
tcloud sso login
Create a cluster:
tcloud cluster create my-cluster \ 
  --num-gpus 8 \
  --reservation-duration 1 \  
  --instance-type H100-SXM \ 
  --region us-central-8 \  
  --shared-volume-name my-volume \   
  --size-tib 1
Optionally, you can specify whether you want to provision reserved capacity or on-demand by using the billing-type field and setting its value to either prepaid (i.e. a reservation) or on_demand.
tcloud cluster create my-cluster \
  --num-gpus 8 \
  --billing-type prepaid \
  --reservation-duration 1 \
  --instance-type H100-SXM \
  --region us-central-8 \
  --shared-volume-name my-volume \
  --size-tib 1
Delete a cluster:
tcloud cluster delete <CLUSTER_UUID>

REST API

All cluster management actions (create, scale, delete, storage, etc.) are available via REST API endpoints for programmatic control. The API documentation can be found here.

Terraform Provider

Use the Together Terraform Provider to define clusters, storage, and scaling policies as code. This allows reproducible infrastructure management integrated with existing Terraform workflows.

SkyPilot

You can orchestrate AI workloads on Instant Clusters using SkyPilot. The following example shows how to use Together with SkyPilot and orchestrate gpt-oss-20b finetuning on it.

Use Together Instant Cluster with SkyPilot

  1.  uv pip install skypilot[kubernetes]
    
  2. Launch a Together Instant Cluster with cluster type selected as Kubernetes
  • Get the Kubernetes config for the cluster
  • Save the kubeconfig to a file say ./together.kubeconfig
  • Copy the kubeconfig to your ~/.kube/config or merge the Kubernetes config with your existing kubeconfig file.
    mkdir -p ~/.kube
    cp together-kubeconfig ~/.kube/config
    
    or
    KUBECONFIG=./together-kubeconfig:~/.kube/config kubectl config view --flatten > /tmp/merged_kubeconfig && mv /tmp/merged_kubeconfig ~/.kube/config    
    
    SkyPilot automatically picks up your credentials to the Together Instant Cluster.
  1. Check that SkyPilot can access the Together Instant Cluster
    $ sky check k8s
    Checking credentials to enable infra for SkyPilot.
      Kubernetes: enabled [compute]
        Allowed contexts:
        └── t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6: enabled.
    
    🎉 Enabled infra 🎉
      Kubernetes [compute]
        Allowed contexts:
        └── t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6
    
    To enable a cloud, follow the hints above and rerun: sky check
    If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html
    
    Your Together cluster is now accessible with SkyPilot.
  2. Check the available GPUs on the cluster:
    $ sky show-gpus --infra k8s
    Kubernetes GPUs
    Context: t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6
    GPU   REQUESTABLE_QTY_PER_NODE  UTILIZATION  
    H100  1, 2, 4, 8                8 of 8 free  
    Kubernetes per-node GPU availability
    CONTEXT                                                                              NODE                GPU   UTILIZATION  
    t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6  cp-8ct86            -     0 of 0 free  
    t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6  cp-fjqbt            -     0 of 0 free  
    t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6  cp-hst5f            -     0 of 0 free  
    t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6-admin@t-51326e6b-25ec-42dd-8077-6f3c9b9a34c6  gpu-dp-gsd6b-k4m4x  H100  8 of 8 free  
    

Example: Finetuning gpt-oss-20b on the Together Instant Cluster

Launch a gpt-oss finetuning job on the Together cluster is now as simple as a single command:
sky launch -c gpt-together gpt-oss-20b.yaml
You can download the yaml file here.

Billing

Compute Billing

Instant Clusters offer two compute billing options: reserved and on-demand.
  • Reservations – Credits are charged upfront or deducted for the full reserved duration once the cluster is provisioned. Any usage beyond the reserved capacity is billed at on-demand rates.
  • On-Demand – Pay only for the time your cluster is running, with no upfront commitment.
See our pricing page for current rates.

Storage Billing

Storage is billed on a pay-as-you-go basis, as detailed on our pricing page. You can freely increase or decrease your storage volume size, with all usage billed at the same rate.

Viewing Usage and Invoices

You can view your current usage anytime on the Billing page in Settings. Each invoice includes a detailed breakdown of reservation, burst, and on-demand usage for compute and storage

Cluster and Storage Lifecycles

Clusters and storage volumes follow different lifecycle policies:
  • Compute Clusters – Clusters are automatically decommissioned when their reservation period ends. To extend a reservation, please contact your account team.
  • Storage Volumes – Storage volumes are persistent and remain available as long as your billing account is in good standing. They are not automatically deleted.

Running Out of Credits

When your credits are exhausted, resources behave differently depending on their type:
  • Reserved Compute – Existing reservations remain active until their scheduled end date. Any additional on-demand capacity used to scale beyond the reservation is decommissioned.
  • Fully On-Demand Compute – Clusters are first paused and then decommissioned if credits are not restored.
  • Storage Volumes – Access is revoked first, and the data is later decommissioned.
You will receive alerts before these actions take place. For questions or assistance, please contact your billing team.