Modify Slurm configuration files to optimize scheduling, resource allocation, and job management for your GPU cluster.
Prerequisites
kubectl CLI installed and configured
- Kubeconfig downloaded from your cluster
- Access to your cluster’s Slurm namespace
Configuration Files
Your Slurm cluster configuration is stored in a Kubernetes ConfigMap with four main files:
| File | Purpose |
|---|
slurm.conf | Main cluster configuration (nodes, partitions, scheduling) |
gres.conf | GPU and generic resource definitions |
cgroup.conf | Control group resource management |
plugstack.conf | SPANK plugin configuration |
Edit Configuration
Update ConfigMap
Edit the ConfigMap directly:
kubectl edit configmap slurm -n slurm
This opens the ConfigMap in your default editor. Make your changes and save.
Alternative method:
# Export to local file
kubectl get configmap slurm -n slurm -o yaml > slurm-config.yaml
# Edit locally
# ... make your changes ...
# Apply changes
kubectl apply -f slurm-config.yaml
Restart Components
After editing the ConfigMap, restart the appropriate components:
For slurm.conf changes:
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm
# Restart compute node pods
kubectl delete pods -n slurm -l app=slurm-compute-production
For gres.conf or plugstack.conf changes:
# Restart compute node pods only
kubectl delete pods -n slurm -l app=slurm-compute-production
Verify Changes
# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm
# Verify configuration in pod
kubectl exec -it slurm-controller-0 -n slurm -- cat /etc/slurm/slurm.conf
# Test Slurm functionality
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show config
Configuration Examples
Edit gres.conf to define GPU resources:
Name=gpu Type=a100 File=/dev/nvidia[0-7]
Name=gpu Type=h100 File=/dev/nvidia[8-15]
Modify Partitions
Edit the partition section in slurm.conf:
PartitionName=gpu Nodes=gpu-nodes State=UP Default=NO MaxTime=24:00:00
PartitionName=cpu Nodes=cpu-nodes State=UP Default=YES
Tune Scheduler
Adjust scheduler parameters in slurm.conf:
SchedulerParameters=batch_sched_delay=10,bf_interval=180,sched_max_job_start=500
Update Resource Allocation
Modify resource allocation settings:
SelectTypeParameters=CR_Core_Memory
DefMemPerCPU=4096 # 4GB per CPU
Enable Cgroup Limits
Edit cgroup.conf to enforce resource limits:
CgroupPlugin=cgroup/v1
ConstrainCores=yes
ConstrainRAMSpace=yes
Then update slurm.conf:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
Troubleshooting
Configuration Not Applied
# Verify ConfigMap was updated
kubectl get configmap slurm -n slurm -o yaml
# Check pod age (should be recent after restart)
kubectl get pods -n slurm
# View controller logs
kubectl logs slurm-controller-0 -n slurm
Syntax Errors
# Check controller logs for errors
kubectl logs slurm-controller-0 -n slurm | grep -i error
# View recent events
kubectl get events -n slurm --sort-by='.lastTimestamp'
Pods Not Restarting
# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm
# Force delete and recreate pod
kubectl delete pod slurm-controller-0 -n slurm
Jobs Failing After Changes
# Check node status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
# Check specific node details
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show node <nodename>
# View job errors
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show job <jobid>
Quick Reference
View Configurations
# View all Slurm configmaps
kubectl get configmaps -n slurm | grep slurm
# View slurm.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.slurm\.conf}'
# View gres.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.gres\.conf}'
Restart Components
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm
# Restart accounting daemon
kubectl rollout restart statefulset slurm-accounting -n slurm
# Restart compute node pods
kubectl delete pods -n slurm -l app=slurm-compute-production
Monitor Cluster
# Watch pod status
kubectl get pods -n slurm -w
# View logs (follow mode)
kubectl logs -f slurm-controller-0 -n slurm
# Check Slurm cluster status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
Best Practices
- Back up configurations before making changes
- Test in development before applying to production
- Make incremental changes to isolate issues
- Document your changes for future reference
- Monitor logs and jobs after applying changes
- Use version control to track configuration changes
Slurm compute nodes run as pods (not daemonsets). When you delete compute node pods, they will automatically restart with the new configuration. Running jobs may be affected during the restart.
Additional Resources