Instant Clusters

This document explains how to create an instant cluster and how to start training with a Kubernetes Cluster.

V.1.4

Create your Instant Cluster

  1. Create a Cluster
1. Click on the Cluster size, for example 8xH100
2. Enter a cluster name
3. Choose a cluster type
4. Select a Region
5. Select the required duration for your cluster
6. Create and name your shared volume. The minimum size is 1TB
7. Optional: Select your Nvidia driver and CUDA versions
8. Click on Proceed
  1. Check Status of your Cluster
  2. Increase your cluster size : click on the … in the cluster line and click on Edit Cluster and click on “Number of GPUs” select the desired amount and click update

Start training with Kubernetes

  1. Prerequisites: install kubernetes in your environment. For example on MAC install this: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/

  2. Get the cluster Kubeconfig
    To schedule kubernetes jobs on your cluster, download the kubectl context from the Instant Clusters UI page and copy it to your local machine

~/.kube/config_k8s_together_instant

export KUBECONFIG=$HOME/.kube/config_k8s_together_instant or kubectl --kubeconfig=$HOME/.kube/config_k8s_together_instant get nodes

Note: It’s possible to name config as the default “config”. If doing so, make sure to take a backup of your current config file prior

  1. Verify you can connect to your K8s cluster
kubectl get nodes
NAME                             STATUS   ROLES                       AGE   VERSION
5fa43eae-01.cloud.together.ai    Ready    <none>                      21h   v1.31.4+k3s1
5fa43eae-02.cloud.together.ai    Ready    <none>                      21h   v1.31.4+k3s1
5fa43eae-hn1.cloud.together.ai   Ready    control-plane,etcd,master   22h   v1.31.4+k3s1
5fa43eae-hn2.cloud.together.ai   Ready    control-plane,etcd,master   8h    v1.31.4+k3s1
5fa43eae-hn3.cloud.together.ai   Ready    control-plane,etcd,master   22h   v1.31.4+k3s1

  1. How to deploy a pod from a docker image
    1. Create a manifest yaml for storage to mount on your container
    2. Apply the manifest: kubectl apply -f pvc.yaml

 apiVersion: v1  
kind: PersistentVolumeClaim  
metadata:  
  name: shared-pvc  
spec:  
  accessModes:  
    - ReadWriteOnce  
  resources:  
    requests:  
      storage: 1Ti  
  storageClassName: shared-rdma

***

apiVersion: v1  
kind: PersistentVolumeClaim  
metadata:  
  name: local-pvc  
spec:  
  accessModes:  
    - ReadWriteOnce  
  resources:  
    requests:  
      storage: 1Ti  
  storageClassName: scratch-storage-gpu



    1. Create a manifest yaml file with your docker image and mount the volumes created above. This is a general purpose shell test pod with ubuntu allowing you to see files on the data volume for example.

b. Create the pod by running kubectl apply -f manifest.yaml

c. Get a shell into the pod by running kubectl exec -it test-pod -- bash

  1. Access to the Kubernetes Dashboard
    You can access the k8s dashboard by clicking on your cluster’s name, then click on the k8s dashboard url. You will be prompted to enter a password, which can be obtained as follows:

 kubectl --kubeconfig=$HOME/.kube/config get secret admin-user-token -n  
kubernetes-dashboard -o jsonpath={".data.token"} | base64 -d



Using the tcloud CLI

You can also create and manage your GPU clusters within Together’s cloud infrastructure via the tcloud CLI tool. Download it for your platform:

Authenticate with Together Cloud via Google SSO

You can authenticate with Together Cloud using Google Single Sign-On (SSO) via the tcloud CLI. Run the following command:

Callout: You must be part of an approved Google Workspace organization to authenticate.

Open https://www.google.com/device in your browser and enter the verification code to complete authentication.

Create a cluster

Callout: Cluster creation requires a valid payment method to be set up in your account.

You can add a payment method at https://api.together.ai/settings/billing.

tcloud cluster create my-cluster  
    --num-gpus 8  
    --reservation-duration 1  
    --instance-type H100-SXM  
    --region us-central-8  
    --shared-volume-name my-volume  
    --size-tib 1

Deleting a cluster

 tcloud cluster delete \<CLUSTER_UUID>