Startup scripts let you customize Slurm worker nodes, login nodes, and the controller by running shell scripts at specific lifecycle events. Use them to install packages, prepare job environments, clean up after jobs, and append custom Slurm configuration.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
This feature is available on Slurm Slinky v1.0 clusters only. It works for both new cluster creation and editing existing v1.0 clusters.
Script types at a glance
Scripts fall into three categories based on when they run. Node init scripts run once when a node starts up:- Worker init script: Runs on each Slurm worker at boot. Install system packages, configure drivers, or pull container images.
- Login init script: Runs on the login node at startup. Install CLI tools and packages for interactive SSH sessions.
- Worker prolog: Runs on each worker node before a job starts. Set up directories, load datasets into local scratch, or export environment variables.
- Worker epilog: Runs on each worker node after a job ends. Clean up scratch files, flush logs, or reset node state.
- Controller prolog: Runs on the Slurm controller (
slurmctld) at job allocation. Validate job parameters, send notifications, or log job metadata centrally. - Controller epilog: Runs on the Slurm controller (
slurmctld) at job completion. Log results, trigger downstream pipelines, or send completion notifications.
- Extra slurm.conf: Custom lines appended to
slurm.confon all nodes. Use this for scheduler tuning, prolog flags, or partition overrides.
Node init scripts
Worker init script
Runs on each Slurm worker node at boot, before the node accepts jobs. Use it to install packages or configure the environment that all jobs on this node depend on. Common use cases:- Install system packages (
apt-get install -y sox ffmpeg). - Install the AWS CLI or other data-transfer tools.
- Pull container images or download shared assets.
- Configure environment variables that apply to every job.
Login init script
Runs on the login node at startup. The login node is where users SSH in to submit jobs and inspect results, so this script installs tools for interactive use. Common use cases:- Install CLI utilities for data exploration.
- Set up shared Python environments.
- Configure shell defaults for all users.
Job lifecycle scripts
Worker prolog
Runs on each allocated worker node before the job starts. The script runs as root and executes before any user processes launch. Common use cases:- Create per-job scratch directories on local NVMe.
- Stage input data from shared storage to local disk.
- Export environment variables for the job.
- Verify GPU health before the job runs.
By default, the worker prolog runs when the first job step starts on the node, not at allocation time. To run it immediately at allocation, add
PrologFlags=Alloc to the Extra slurm.conf field.Worker epilog
Runs on each allocated worker node after the job ends. The script runs as root and executes after all user processes have terminated. Common use cases:- Remove per-job scratch directories.
- Flush job logs to shared storage.
- Reset GPU state or clear shared memory.
- Kill orphaned processes.
Controller prolog
Runs on the Slurm controller (slurmctld) at job allocation, before the job reaches the worker nodes. Use this for centralized setup that does not need to run on every node.
Common use cases:
- Log job metadata to a central database.
- Send a Slack or webhook notification when a job starts.
- Validate job parameters before workers are assigned.
Controller epilog
Runs on the Slurm controller (slurmctld) at job completion. Use this for centralized teardown or post-job automation.
Common use cases:
- Log job completion and exit status.
- Trigger a downstream pipeline (fine-tuning, evaluation, deployment).
- Send a completion notification.
Extra slurm.conf
Append custom Slurm configuration toslurm.conf on all nodes. Lines entered here are added verbatim after the default configuration.
Common use cases:
- Set
PrologFlags=Allocto run the worker prolog at allocation time instead of first job step. - Tune scheduler parameters.
- Configure accounting or job completion plugins.
Changes to Extra slurm.conf take effect after the cluster applies the updated configuration. Running jobs are not affected until they complete or the nodes restart.
Configure startup scripts
- Open the Together Cloud console and navigate to your cluster.
- In the cluster details sidebar, select Specs and configuration.
- Select Slurm configuration.
- Select Edit.
- Enter your script in the corresponding text box (Worker Prolog, Worker Epilog, Login Init Script, etc.).
- Select Save.
#!/bin/bash). The cluster applies the updated scripts to the relevant nodes automatically.
For prolog and epilog updates, the underlying ConfigMaps update immediately. However, existing worker nodes may cache the previous scripts via Slurm’s configless mechanism. New jobs on those workers continue using the old scripts until the workers restart and pick up the updated versions.
Failure handling
Script failures have different consequences depending on which script fails. Worker prolog failure (non-zero exit):- The node is set to
DRAINstate. - The job is requeued (batch jobs only). Interactive jobs (
salloc,srun) are cancelled.
- The node is set to
DRAINstate. - A drained node does not accept new jobs until an admin resumes it.
- The job is requeued (batch jobs) or cancelled (interactive jobs).
- The node is not affected.
- The failure is logged but has no other effect on the job or node.
Best practices
- Keep scripts short and fast. Long-running scripts delay job scheduling.
- Use
set -ein init scripts so failures surface immediately instead of silently continuing. - Do not call Slurm commands (
squeue,scontrol,sacctmgr) inside prolog or epilog scripts. This can cause deadlocks and degrade scheduler performance. - Use the
SLURM_JOB_IDandSLURM_JOB_USERenvironment variables to scope cleanup and logging to the correct job. - Test scripts on a development cluster before applying them to production.
- Use Extra slurm.conf with
PrologFlags=Allocif your worker prolog must run at allocation time rather than at first job step.
Troubleshooting
Node stuck in DRAIN state after script failure
A worker prolog or epilog that returns a non-zero exit code drains the node. Fix:- SSH into the login node and check the script output in
/var/log/slurm/. - Fix the script, then resume the node:
Init script packages not available in jobs
The worker init script runs at node boot, not at job start. If a package install fails silently, jobs will not have the expected tools. Fix:- Add
set -eto your init script to catch failures. - SSH into a worker node and verify the package is installed.
Worker prolog not running at allocation time
By default, the worker prolog runs at first job step, not at allocation. If your prolog must run immediately when the job is allocated, addPrologFlags=Alloc to the Extra slurm.conf field.