Instant Clusters

This document explains how to create an instant cluster and how to start training with a Kubernetes Cluster.

V.1.4

Create your Instant Cluster

  1. Create a Cluster

alt_text

1. Click on the Cluster size, for example 8xH100
2. Enter a cluster name
3. Choose a cluster type
4. Select a Region
5. Select the required duration for your cluster
6. Create and name your shared volume. The minimum size is 1TB
7. Optional: Select your Nvidia driver and CUDA versions
8. Click on Proceed
  1. Check Status of your Cluster
  2. Increase your cluster size : click on the … in the cluster line and click on Edit Cluster and click on “Number of GPUs” select the desired amount and click update

alt_text

Figure: Edit your Cluster Selection \

alt_text

Figure: Update your Cluster

Start training with Kubernetes

  1. Prerequisites: install kubernetes in your environment. For example on MAC install this: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/

  2. Get the cluster Kubeconfig \

    1. To schedule kubernetes jobs on your cluster, download the kubectl context from the Instant Clusters UI page and copy it to your local machine in ~/.kube/config_k8s_together_instant
    2. export KUBECONFIG=$HOME/.kube/config_k8s_together_instant or kubectl --kubeconfig=$HOME/.kube/config_k8s_together_instant get nodes

    Note: It’s possible to name config as the default “config”. If doing so, make sure to take a backup of your current config file prior \

    1. Verify you can connect to your K8s cluster

    \

  3. How to deploy a pod from a docker image

  4. Create a manifest yaml for storage to mount on your container

  5. Apply the manifest: kubectl apply -f pvc.yaml

    1. Create a manifest yaml file with your docker image and mount the volumes created above. This is a general purpose shell test pod with ubuntu allowing you to see files on the data volume for example.

    b. Create the pod by running kubectl apply -f manifest.yaml

    c. Get a shell into the pod by running kubectl exec -it test-pod -- bash

  6. How to start training \

  7. How to create storage \

    1. How to see how much storage there is:
  8. Access to the Kubernetes Dashboard 1.


    You can access the k8s dashboard by clicking on your cluster’s name, then click on the k8s dashboard url. You will be prompted to enter a password, which can be obtained as follows: \

  9. Operators \

    1. MPI
    1. TorchX \
  10. Configuring Ingress : how to expose services to the cluster external IP address \

    1. Traefik is installed by default in your Instant Cluster.

      HTTPS to mydomain.com, traffic will be forwarded to a service ‘myservice’ on port 4000

    2. Others

  11. Performance testing \

    1. NCCL
    2. Others

To run performance tests:

  1. Install an MPI Operator
  2. Apply the nccl-config yaml
  3. Start the MPI Job
  4. Look at the NCCL results in the logs (do a get nodes to see the launcher id XXXXX)

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yam

kubectl apply -f nccl-config.yaml

kubectl apply -f mpijob.yaml

kubectl apply -f get pods -> see what the nccl-test-launcher id is

kubectl logs nccl-test-launcher-XXXX

nccl-config (1).yaml

mpijob (1).yaml

  1. FIO test (test takes ~10 minutes)

fio_test.yaml

  1. kubectl -f apply fio_test.yaml
  2. kubectl logs fio-storage-test-pod

Expected results:

  1. Iperf (update once paul gives us more stable iperf servers)

Where to get the output?

Using the tcloud CLI

You can also create and manage your GPU clusters within Together’s cloud infrastructure via the tcloud CLI tool. Download it for your platform:

Authenticate with Together Cloud via Google SSO

You can authenticate with Together Cloud using Google Single Sign-On (SSO) via the tcloud CLI. Run the following command:

Callout: You must be part of an approved Google Workspace organization to authenticate.

Open https://www.google.com/device in your browser and enter the verification code to complete authentication.

Create a cluster

Callout: Cluster creation requires a valid payment method to be set up in your account.

You can add a payment method at https://api.together.ai/settings/billing.

Deleting a cluster