Cluster setup
Large pre-training and custom fine-tuning service.
Together GPU Clusters provides a complete cloud and software solution that enables you to pre-train your own large models quickly, reliably, and efficiently. We also offer custom fine-tuning. Users can reserve segments of the Together Cloud under short-term leases to train their own models.
Built on Together Cloud, Together GPU Clusters are comprised of state-of-the-art clusters with H100 and A100 GPUs, connected over fast Ethernet or Infiniband networks and optimized for distributed training with options to use tools like Slurm. Users can submit their computing jobs to the Slurm head node where the scheduler will assign the tasks to available compute nodes based on resource availability.
Request access to a cluster here.
If you already have a cluster, read this article to learn more about logging into a cluster head node to submit jobs, monitor progress, and get the results.
Connecting to your Together GPU Cluster
To connect to the cluster, follow these steps:
- Open a terminal or SSH client on your local machine.
- Use the SSH command to connect to the cluster's remote head node:
ssh <username>@<hostname>
Replace with your username and with the hostname of the head node you were assigned.
- Once connected, you should be in the home directory of the head node on the cluster.
You can verify the number of nodes in the cluster using thesinfo
command on the head node
username@<hostname>:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 8 idle <hostname>[1-8]
Submitting a job to the cluster
Next step is to write and save a SBATCH script to submit your job to the cluster. This sample script pulls a docker image and then runs in on the cluster. It is saved as myscript
#!/bin/bash
docker pull dockeruser/nanogpt
docker run --rm -t --gpus all dockeruser/nanogpt
Now we are ready to submit our job to Forge using sbatch
. This command submits the job script to the scheduler for execution
username@<hostname>:~$ sbatch -N1 myscript
Once the job is submitted, Slurm will assign a job ID to the job and provide a confirmation message. Make a note of the job ID for future reference.
username@<hostname>:~$ sbatch -N1 myscript
Submitted batch job 1
Monitoring Job Status
To monitor the status of your submitted job, you can use the squeue
command as follows:
squeue: List all currently running and pending jobs in the cluster.
squeue -u <username>: List only the jobs submitted by a specific user.
squeue -j \<job_id>: Display detailed information about a specific job using its job ID.
These commands will provide information about the job's status, such as its state (running, pending, completed), allocated resources, and job dependencies. For example:
username@<hostname>:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 batch myscript username R 0:08 1 <hostname>
The slurm 'squeue' documentation will provide more details on how to use this command
Additionally, you can use the scontrol
command to query and control job and job step properties.
username@<hostname>:~$ scontrol show job 1
JobId=19 JobName=myscript
UserId=<username> GroupId=<groupname> MCS_label=N/A
Priority=4294901747 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:56 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2023-06-01T20:31:48 EligibleTime=2023-06-01T20:31:48
AccrueTime=2023-06-01T20:31:48
StartTime=2023-06-01T20:31:48 EndTime=2023-06-01T20:32:48 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-01T20:31:48 Scheduler=Main
Partition=batch AllocNode:Sid=<hostname>:88275
ReqNodeList=(null) ExcNodeList=(null)
NodeList=<hostname>
BatchHost=<hostname>
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/<username>/myscript
WorkDir=/home/<username>
StdErr=/home/<username>/slurm-19.out
StdIn=/dev/null
StdOut=/home/<username>/slurm-19.out
Power=
Read the slurm scontrol documentation for more detials
Updated about 1 month ago