Skip to main content
Modify Slurm configuration files to optimize scheduling, resource allocation, and job management for your GPU cluster.

Prerequisites

  • kubectl CLI installed and configured
  • Kubeconfig downloaded from your cluster
  • Access to your cluster’s Slurm namespace

Configuration Files

Your Slurm cluster configuration is stored in a Kubernetes ConfigMap with four main files:
FilePurpose
slurm.confMain cluster configuration (nodes, partitions, scheduling)
gres.confGPU and generic resource definitions
cgroup.confControl group resource management
plugstack.confSPANK plugin configuration

Edit Configuration

Update ConfigMap

Edit the ConfigMap directly:
kubectl edit configmap slurm -n slurm
This opens the ConfigMap in your default editor. Make your changes and save. Alternative method:
# Export to local file
kubectl get configmap slurm -n slurm -o yaml > slurm-config.yaml

# Edit locally
# ... make your changes ...

# Apply changes
kubectl apply -f slurm-config.yaml

Restart Components

After editing the ConfigMap, restart the appropriate components: For slurm.conf changes:
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm

# Restart compute nodes
kubectl rollout restart daemonset slurm-node -n slurm
For gres.conf or plugstack.conf changes:
# Restart compute nodes only
kubectl rollout restart daemonset slurm-node -n slurm

Verify Changes

# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm

# Verify configuration in pod
kubectl exec -it slurm-controller-0 -n slurm -- cat /etc/slurm/slurm.conf

# Test Slurm functionality
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show config

Configuration Examples

Configure GPU Resources

Edit gres.conf to define GPU resources:
Name=gpu Type=a100 File=/dev/nvidia[0-7]
Name=gpu Type=h100 File=/dev/nvidia[8-15]

Modify Partitions

Edit the partition section in slurm.conf:
PartitionName=gpu Nodes=gpu-nodes State=UP Default=NO MaxTime=24:00:00
PartitionName=cpu Nodes=cpu-nodes State=UP Default=YES

Tune Scheduler

Adjust scheduler parameters in slurm.conf:
SchedulerParameters=batch_sched_delay=10,bf_interval=180,sched_max_job_start=500

Update Resource Allocation

Modify resource allocation settings:
SelectTypeParameters=CR_Core_Memory
DefMemPerCPU=4096  # 4GB per CPU

Enable Cgroup Limits

Edit cgroup.conf to enforce resource limits:
CgroupPlugin=cgroup/v1
ConstrainCores=yes
ConstrainRAMSpace=yes
Then update slurm.conf:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

Troubleshooting

Configuration Not Applied

# Verify ConfigMap was updated
kubectl get configmap slurm -n slurm -o yaml

# Check pod age (should be recent after restart)
kubectl get pods -n slurm

# View controller logs
kubectl logs slurm-controller-0 -n slurm

Syntax Errors

# Check controller logs for errors
kubectl logs slurm-controller-0 -n slurm | grep -i error

# View recent events
kubectl get events -n slurm --sort-by='.lastTimestamp'

Pods Not Restarting

# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm

# Force delete and recreate pod
kubectl delete pod slurm-controller-0 -n slurm

Jobs Failing After Changes

# Check node status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo

# Check specific node details
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show node <nodename>

# View job errors
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show job <jobid>

Quick Reference

View Configurations

# View all Slurm configmaps
kubectl get configmaps -n slurm | grep slurm

# View slurm.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.slurm\.conf}'

# View gres.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.gres\.conf}'

Restart Components

# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm

# Restart accounting daemon
kubectl rollout restart statefulset slurm-accounting -n slurm

# Restart compute nodes
kubectl rollout restart daemonset slurm-node -n slurm

Monitor Cluster

# Watch pod status
kubectl get pods -n slurm -w

# View logs (follow mode)
kubectl logs -f slurm-controller-0 -n slurm

# Check Slurm cluster status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo

Best Practices

  • Back up configurations before making changes
  • Test in development before applying to production
  • Make incremental changes to isolate issues
  • Document your changes for future reference
  • Monitor logs and jobs after applying changes
  • Use version control to track configuration changes
Running jobs are not affected by configuration changes. Changes persist across pod restarts, but rolling restarts minimize downtime.

Additional Resources