Instant Clusters
This document explains how to create an instant cluster and how to start training with a Kubernetes Cluster.
V.1.4
Create your Instant Cluster
- Create a Cluster
1. Click on the Cluster size, for example 8xH100
2. Enter a cluster name
3. Choose a cluster type
4. Select a Region
5. Select the required duration for your cluster
6. Create and name your shared volume. The minimum size is 1TB
7. Optional: Select your Nvidia driver and CUDA versions
8. Click on Proceed
- Check Status of your Cluster
- Increase your cluster size : click on the … in the cluster line and click on Edit Cluster and click on “Number of GPUs” select the desired amount and click update
Start training with Kubernetes
-
Prerequisites: install kubernetes in your environment. For example on MAC install this: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/
-
Get the cluster Kubeconfig
To schedule kubernetes jobs on your cluster, download the kubectl context from the Instant Clusters UI page and copy it to your local machine
~/.kube/config_k8s_together_instant
export KUBECONFIG=$HOME/.kube/config_k8s_together_instant or kubectl --kubeconfig=$HOME/.kube/config_k8s_together_instant get nodes
Note: It’s possible to name config as the default “config”. If doing so, make sure to take a backup of your current config file prior
- Verify you can connect to your K8s cluster
kubectl get nodes
NAME STATUS ROLES AGE VERSION
5fa43eae-01.cloud.together.ai Ready <none> 21h v1.31.4+k3s1
5fa43eae-02.cloud.together.ai Ready <none> 21h v1.31.4+k3s1
5fa43eae-hn1.cloud.together.ai Ready control-plane,etcd,master 22h v1.31.4+k3s1
5fa43eae-hn2.cloud.together.ai Ready control-plane,etcd,master 8h v1.31.4+k3s1
5fa43eae-hn3.cloud.together.ai Ready control-plane,etcd,master 22h v1.31.4+k3s1
- How to deploy a pod from a docker image
- Create a manifest yaml for storage to mount on your container
- Apply the manifest:
kubectl apply -f pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: shared-rdma
***
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: scratch-storage-gpu
-
- Create a manifest yaml file with your docker image and mount the volumes created above. This is a general purpose shell test pod with ubuntu allowing you to see files on the data volume for example.
b. Create the pod by running kubectl apply -f manifest.yaml
c. Get a shell into the pod by running kubectl exec -it test-pod -- bash
- Access to the Kubernetes Dashboard
You can access the k8s dashboard by clicking on your cluster’s name, then click on the k8s dashboard url. You will be prompted to enter a password, which can be obtained as follows:
kubectl --kubeconfig=$HOME/.kube/config get secret admin-user-token -n
kubernetes-dashboard -o jsonpath={".data.token"} | base64 -d
Using the tcloud
CLI
tcloud
CLIYou can also create and manage your GPU clusters within Together’s cloud infrastructure via the tcloud
CLI tool. Download it for your platform:
Authenticate with Together Cloud via Google SSO
You can authenticate with Together Cloud using Google Single Sign-On (SSO) via the tcloud CLI. Run the following command:
Callout: You must be part of an approved Google Workspace organization to authenticate.
Open https://www.google.com/device in your browser and enter the verification code to complete authentication.
Create a cluster
Callout: Cluster creation requires a valid payment method to be set up in your account.
You can add a payment method at https://api.together.ai/settings/billing.
tcloud cluster create my-cluster
--num-gpus 8
--reservation-duration 1
--instance-type H100-SXM
--region us-central-8
--shared-volume-name my-volume
--size-tib 1
Deleting a cluster
tcloud cluster delete \<CLUSTER_UUID>
Updated 13 minutes ago