Instant Clusters
This document explains how to create an instant cluster and how to start training with a Kubernetes Cluster.
V.1.5
Create your Instant Cluster
- Create a Cluster
1. Click on the Cluster size, for example 8xH100
2. Enter a cluster name
3. Choose a cluster type
4. Select a Region
5. Select the required duration for your cluster
6. Create and name your shared volume. The minimum size is 1TB
7. Optional: Select your Nvidia driver and CUDA versions
8. Click on Proceed
- Check Status of your Cluster
- Increase your cluster size : click on the … in the cluster line and click on Edit Cluster and click on “Number of GPUs” select the desired amount and click update
Start training with Kubernetes
-
Prerequisites: install kubernetes in your environment. For example on MAC install this: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/
-
Get the cluster Kubeconfig
To schedule kubernetes jobs on your cluster, download the kubectl context from the Instant Clusters UI page and copy it to your local machine
~/.kube/config_k8s_together_instant
export KUBECONFIG=$HOME/.kube/config_k8s_together_instant or kubectl --kubeconfig=$HOME/.kube/config_k8s_together_instant get nodes
Note: It’s possible to name config as the default “config”. If doing so, make sure to take a backup of your current config file prior
- Verify you can connect to your K8s cluster
kubectl get nodes
NAME STATUS ROLES AGE VERSION
5fa43eae-01.cloud.together.ai Ready <none> 21h v1.31.4+k3s1
5fa43eae-02.cloud.together.ai Ready <none> 21h v1.31.4+k3s1
5fa43eae-hn1.cloud.together.ai Ready control-plane,etcd,master 22h v1.31.4+k3s1
5fa43eae-hn2.cloud.together.ai Ready control-plane,etcd,master 8h v1.31.4+k3s1
5fa43eae-hn3.cloud.together.ai Ready control-plane,etcd,master 22h v1.31.4+k3s1
- How to deploy a pod from a docker image
- Create a manifest yaml for storage to mount on your container
- Apply the manifest:
kubectl apply -f pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: shared-rdma
***
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: scratch-storage-gpu
iii. Create a manifest yaml file with your docker image and mount the volumes created above. This is a general purpose shell test pod with ubuntu allowing you to see files on the data volume for example.
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: test-pod
image: [registry/]repository/ubuntu[:tag]
command: ["/bin/sh", "-c"]
args: ["sleep infinity"]
volumeMounts:
- name: shared-pvc
mountPath: /<path-for-shared>
- name: local-pvc
mountPath: /<path-for-local>
volumes:
- name: shared-pvc
persistentVolumeClaim:
claimName: shared-pvc
- name: local-pvc
persistentVolumeClaim:
claimName: local-pvc
---- Real manifest ----
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: debian:stable-slim
command: ["/bin/sh", "-c", "sleep infinity"]
volumeMounts:
- name: shared-pvc
mountPath: /mnt/shared
- name: local-pvc
mountPath: /mnt/local
volumes:
- name: shared-pvc
persistentVolumeClaim:
claimName: shared-pvc
- name: local-pvc
persistentVolumeClaim:
claimName: local-pvc
b. Create the pod by running kubectl apply -f manifest.yaml
c. Get a shell into the pod by running kubectl exec -it test-pod -- bash
- How to access to the Kubernetes Dashboard
You can access the k8s dashboard by clicking on your cluster’s name, then click on the k8s dashboard url. You will be prompted to enter a password, which can be obtained as follows:
kubectl --kubeconfig=$HOME/.kube/config get secret admin-user-token -n
kubernetes-dashboard -o jsonpath={".data.token"} | base64 -d
Using the tcloud
CLI
tcloud
CLIYou can also create and manage your GPU clusters within Together’s cloud infrastructure via the tcloud
CLI tool. Download it for your platform:
Authenticate with Together Cloud via Google SSO
You can authenticate with Together Cloud using Google Single Sign-On (SSO) via the tcloud CLI. Run the following command:
%tcloud sso login
Your verification code is: ABC-DEFG-HIJ
Opening browser to https://www.google.com/device
Waiting for device authorization...
Open https://www.google.com/device in your browser and enter the verification code to complete authentication. Note: You must be part of an approved Google Workspace organization to authenticate.
Create a cluster
Callout: Cluster creation requires a valid payment method to be set up in your account.
You can add a payment method at https://api.together.ai/settings/billing.
tcloud cluster create my-cluster
--num-gpus 8
--reservation-duration 1
--instance-type H100-SXM
--region us-central-8
--shared-volume-name my-volume
--size-tib 1
Deleting a cluster
tcloud cluster delete \<CLUSTER_UUID>
Updated 9 days ago