Cluster setup

Large pre-training and custom fine-tuning service.

Together GPU Clusters provides a complete cloud and software solution that enables you to pre-train your own large models quickly, reliably, and efficiently. We also offer custom fine-tuning. Users can reserve segments of the Together Cloud under short-term leases to train their own models.

Built on Together Cloud, Together GPU Clusters are comprised of state-of-the-art clusters with H100 and A100 GPUs, connected over fast Ethernet or Infiniband networks and optimized for distributed training with options to use tools like Slurm. Users can submit their computing jobs to the Slurm head node where the scheduler will assign the tasks to available compute nodes based on resource availability.

Request access to a cluster here.

If you already have a cluster, read this article to learn more about logging into a cluster head node to submit jobs, monitor progress, and get the results.

Connecting to your Together GPU Cluster

To connect to the cluster, follow these steps:

  1. Open a terminal or SSH client on your local machine.
  2. Use the SSH command to connect to the cluster's remote head node:
ssh <username>@<hostname>

Replace with your username and with the hostname of the head node you were assigned.

  1. Once connected, you should be in the home directory of the head node on the cluster.
    You can verify the number of nodes in the cluster using the sinfo command on the head node
username@<hostname>:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up   infinite      8   idle <hostname>[1-8]

Submitting a job to the cluster

Next step is to write and save a SBATCH script to submit your job to the cluster. This sample script pulls a docker image and then runs in on the cluster. It is saved as myscript

#!/bin/bash

docker pull dockeruser/nanogpt

docker run --rm -t --gpus all dockeruser/nanogpt

Now we are ready to submit our job to Forge using sbatch . This command submits the job script to the scheduler for execution

username@<hostname>:~$ sbatch -N1 myscript

Once the job is submitted, Slurm will assign a job ID to the job and provide a confirmation message. Make a note of the job ID for future reference.

username@<hostname>:~$ sbatch -N1 myscript
Submitted batch job 1

Monitoring Job Status

To monitor the status of your submitted job, you can use the squeue command as follows:

squeue: List all currently running and pending jobs in the cluster.

squeue -u <username>: List only the jobs submitted by a specific user.

squeue -j \<job_id>: Display detailed information about a specific job using its job ID.

These commands will provide information about the job's status, such as its state (running, pending, completed), allocated resources, and job dependencies. For example:

username@<hostname>:~$ squeue
JOBID 	PARTITION   NAME     USER 			ST       TIME  		NODES 	NODELIST(REASON)
1     	batch 			myscript username  	R       0:08      1 			<hostname> 

The slurm 'squeue' documentation will provide more details on how to use this command

Additionally, you can use the scontrol command to query and control job and job step properties.

username@<hostname>:~$ scontrol show job 1
JobId=19 JobName=myscript
   UserId=<username> GroupId=<groupname> MCS_label=N/A
   Priority=4294901747 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:56 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2023-06-01T20:31:48 EligibleTime=2023-06-01T20:31:48
   AccrueTime=2023-06-01T20:31:48
   StartTime=2023-06-01T20:31:48 EndTime=2023-06-01T20:32:48 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-01T20:31:48 Scheduler=Main
   Partition=batch AllocNode:Sid=<hostname>:88275
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=<hostname>
   BatchHost=<hostname>
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/<username>/myscript
   WorkDir=/home/<username>
   StdErr=/home/<username>/slurm-19.out
   StdIn=/dev/null
   StdOut=/home/<username>/slurm-19.out
   Power=

Read the slurm scontrol documentation for more detials