Together Compute is a complete cloud and software solution that enables you to pre-train your own large models quickly, reliably, and efficiently. We also offer custom fine-tuning. Users can reserve segments of the Together Cloud clusters under short-term leases to train their own models.
Built on Together Cloud, Compute is comprised of state-of-the-art clusters with H100 and A100 GPUs, connected over fast Ethernet or Infiniband networks and optimized for distributed training with options to use tools like Slurm. Users can submit their computing jobs to the Slurm head node where the scheduler will assign the tasks to available compute nodes based on resource availability.
If you already have a cluster, read this article to learn more about logging into a Compute cluster head node to submit jobs, monitor progress, and get the results.
To connect to the cluster, follow these steps:
- Open a terminal or SSH client on your local machine.
- Use the SSH command to connect to the cluster's remote head node:
Replace with your username and with the hostname of the head node you were assigned.
- Once connected, you should be in the home directory of the head node on the cluster.
You can verify the number of nodes in the cluster using the
sinfocommand on the head node
username@<hostname>:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch* up infinite 8 idle <hostname>[1-8]
Next step is to write and save a SBATCH script to submit your job to the cluster. This sample script pulls a docker image and then runs in on the cluster. It is saved as
#!/bin/bash docker pull dockeruser/nanogpt docker run --rm -t --gpus all dockeruser/nanogpt
Now we are ready to submit our job to Forge using
sbatch . This command submits the job script to the scheduler for execution
username@<hostname>:~$ sbatch -N1 myscript
Once the job is submitted, Slurm will assign a job ID to the job and provide a confirmation message. Make a note of the job ID for future reference.
username@<hostname>:~$ sbatch -N1 myscript Submitted batch job 1
To monitor the status of your submitted job, you can use the
squeue command as follows:
squeue: List all currently running and pending jobs in the cluster. squeue -u <username>: List only the jobs submitted by a specific user. squeue -j \<job_id>: Display detailed information about a specific job using its job ID.
These commands will provide information about the job's status, such as its state (running, pending, completed), allocated resources, and job dependencies. For example:
username@<hostname>:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 batch myscript username R 0:08 1 <hostname>
The slurm 'squeue' documentation will provide more details on how to use this command
Additionally, you can use the
scontrol command to query and control job and job step properties.
username@<hostname>:~$ scontrol show job 1 JobId=19 JobName=myscript UserId=<username> GroupId=<groupname> MCS_label=N/A Priority=4294901747 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:56 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2023-06-01T20:31:48 EligibleTime=2023-06-01T20:31:48 AccrueTime=2023-06-01T20:31:48 StartTime=2023-06-01T20:31:48 EndTime=2023-06-01T20:32:48 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-06-01T20:31:48 Scheduler=Main Partition=batch AllocNode:Sid=<hostname>:88275 ReqNodeList=(null) ExcNodeList=(null) NodeList=<hostname> BatchHost=<hostname> NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/<username>/myscript WorkDir=/home/<username> StdErr=/home/<username>/slurm-19.out StdIn=/dev/null StdOut=/home/<username>/slurm-19.out Power=
Read the slurm scontrol documentation for more detials
Updated 24 days ago