Skip to main content
Modify Slurm configuration files to optimize scheduling, resource allocation, and job management for your GPU cluster.

Prerequisites

  • kubectl CLI installed and configured
  • Kubeconfig downloaded from your cluster
  • Access to your cluster’s Slurm namespace

Configuration Files

Your Slurm cluster configuration is stored in a Kubernetes ConfigMap with four main files:
FilePurpose
slurm.confMain cluster configuration (nodes, partitions, scheduling)
gres.confGPU and generic resource definitions
cgroup.confControl group resource management
plugstack.confSPANK plugin configuration

Edit Configuration

Update ConfigMap

Edit the ConfigMap directly:
kubectl edit configmap slurm -n slurm
This opens the ConfigMap in your default editor. Make your changes and save. Alternative method:
# Export to local file
kubectl get configmap slurm -n slurm -o yaml > slurm-config.yaml

# Edit locally
# ... make your changes ...

# Apply changes
kubectl apply -f slurm-config.yaml

Restart Components

After editing the ConfigMap, restart the appropriate components: For slurm.conf changes:
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm

# Restart compute node pods
kubectl delete pods -n slurm -l app=slurm-compute-production
For gres.conf or plugstack.conf changes:
# Restart compute node pods only
kubectl delete pods -n slurm -l app=slurm-compute-production

Verify Changes

# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm

# Verify configuration in pod
kubectl exec -it slurm-controller-0 -n slurm -- cat /etc/slurm/slurm.conf

# Test Slurm functionality
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show config

Configuration Examples

Configure GPU Resources

Edit gres.conf to define GPU resources:
Name=gpu Type=a100 File=/dev/nvidia[0-7]
Name=gpu Type=h100 File=/dev/nvidia[8-15]

Modify Partitions

Edit the partition section in slurm.conf:
PartitionName=gpu Nodes=gpu-nodes State=UP Default=NO MaxTime=24:00:00
PartitionName=cpu Nodes=cpu-nodes State=UP Default=YES

Tune Scheduler

Adjust scheduler parameters in slurm.conf:
SchedulerParameters=batch_sched_delay=10,bf_interval=180,sched_max_job_start=500

Update Resource Allocation

Modify resource allocation settings:
SelectTypeParameters=CR_Core_Memory
DefMemPerCPU=4096  # 4GB per CPU

Enable Cgroup Limits

Edit cgroup.conf to enforce resource limits:
CgroupPlugin=cgroup/v1
ConstrainCores=yes
ConstrainRAMSpace=yes
Then update slurm.conf:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

Troubleshooting

Configuration Not Applied

# Verify ConfigMap was updated
kubectl get configmap slurm -n slurm -o yaml

# Check pod age (should be recent after restart)
kubectl get pods -n slurm

# View controller logs
kubectl logs slurm-controller-0 -n slurm

Syntax Errors

# Check controller logs for errors
kubectl logs slurm-controller-0 -n slurm | grep -i error

# View recent events
kubectl get events -n slurm --sort-by='.lastTimestamp'

Pods Not Restarting

# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm

# Force delete and recreate pod
kubectl delete pod slurm-controller-0 -n slurm

Jobs Failing After Changes

# Check node status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo

# Check specific node details
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show node <nodename>

# View job errors
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show job <jobid>

Quick Reference

View Configurations

# View all Slurm configmaps
kubectl get configmaps -n slurm | grep slurm

# View slurm.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.slurm\.conf}'

# View gres.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.gres\.conf}'

Restart Components

# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm

# Restart accounting daemon
kubectl rollout restart statefulset slurm-accounting -n slurm

# Restart compute node pods
kubectl delete pods -n slurm -l app=slurm-compute-production

Monitor Cluster

# Watch pod status
kubectl get pods -n slurm -w

# View logs (follow mode)
kubectl logs -f slurm-controller-0 -n slurm

# Check Slurm cluster status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo

Best Practices

  • Back up configurations before making changes
  • Test in development before applying to production
  • Make incremental changes to isolate issues
  • Document your changes for future reference
  • Monitor logs and jobs after applying changes
  • Use version control to track configuration changes
Slurm compute nodes run as pods (not daemonsets). When you delete compute node pods, they will automatically restart with the new configuration. Running jobs may be affected during the restart.

Additional Resources