What Is Slurm?
Slurm (Simple Linux Utility for Resource Management) is the dominant job scheduler for HPC clusters. It decides which jobs run, on which nodes, and when. If you’ve trained a model on a multi-node GPU cluster, Slurm almost certainly managed the resource allocation.
At its core, Slurm solves a bin-packing problem: given N compute nodes with finite CPUs, GPUs, and memory, schedule M jobs to maximize utilization while respecting resource constraints and fairness policies.
Architecture: The Three Daemons
Slurm runs three processes that coordinate job scheduling across the cluster.
slurmctld (Controller)
The central brain. It maintains the cluster state — which nodes are up, which jobs are queued, which resources are free. All scheduling decisions happen here. Typically runs on a dedicated management node with a backup for high availability.
slurmd (Compute Node Daemon)
Runs on every compute node. It receives job assignments from slurmctld, launches user processes, monitors resource usage, and reports status back. Think of it as the local execution agent.
slurmdbd (Database Daemon)
Optional but common in production. Records job history, resource usage, and accounting data to a database (usually MySQL/MariaDB). Enables sacct queries and fair-share scheduling.
Why Three Daemons?
Separation of concerns: the controller makes scheduling decisions, the node daemons execute them, and the database daemon records everything. This architecture scales to clusters with thousands of nodes.
Core Commands
sbatch — Submit Batch Jobs
The most common command. Submits a job script to the queue and returns immediately with a job ID. The script runs later when resources become available.
sbatch train.sh # Submitted batch job 12345
srun — Run Interactive/Parallel Tasks
Runs a command directly on allocated resources. Used for interactive work or to launch parallel tasks within an existing allocation. Blocks until the command completes.
srun --nodes=2 --ntasks=8 python distributed_train.py
salloc — Allocate Resources
Requests an interactive allocation without running anything. You get a shell on the allocated resources and can run commands manually. Useful for debugging.
salloc --nodes=1 --gres=gpu:2 --time=01:00:00 # salloc: Granted job allocation 12346
Job Lifecycle
Every Slurm job follows the same state machine: submitted → queued → scheduled → running → completed (or failed). The transition from PENDING to RUNNING depends on resource availability, priority, and partition constraints.
Monitoring Jobs
Three commands give you visibility into what’s happening on the cluster:
squeue — What’s Running Now
# Your jobs squeue -u $USER # All GPU jobs with custom format squeue -p gpu -o "%.8i %.9P %.20j %.8u %.2t %M %l %R" # Estimated start time for pending jobs squeue --start -u $USER
sacct — What Already Happened
# Job history with efficiency sacct -j 12345 --format=JobID,JobName,State,Elapsed,TotalCPU,MaxRSS,ExitCode # All your jobs from the last 7 days sacct -u $USER --starttime=$(date -d '7 days ago' +%Y-%m-%d) --format=JobID,JobName,State,Elapsed
scontrol — Full Job Details
# Everything about a specific job scontrol show job 12345 # Key fields to check: Reason (why PENDING), ExitCode (why FAILED), RunTime vs TimeLimit
Writing Job Scripts
A Slurm job script is a shell script with #SBATCH directives that specify resource requirements:
#!/bin/bash #SBATCH --job-name=train-resnet #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:a100:4 #SBATCH --cpus-per-task=8 #SBATCH --mem=128G #SBATCH --time=08:00:00 #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=user@example.com set -euo pipefail trap 'echo "Job $SLURM_JOB_ID failed at line $LINENO" >&2' ERR # --- Environment --- module load cuda/12.1 source activate myenv # --- Run --- srun python train.py --epochs 100 --checkpoint-dir /scratch/$USER/$SLURM_JOB_ID
Key directives:
--job-name: Human-readable name (shows insqueue)--partition: Which queue to submit to (determines available hardware)--nodes: Number of compute nodes--ntasks-per-node: Processes per node--gres: Generic resources like GPUs--time: Maximum wall time (job killed if exceeded)--output/--error: Where stdout/stderr go (%jexpands to job ID)
Error Handling Matters
set -euo pipefail stops the script on any error. Without it, a failed
module load silently continues, and your training runs on CPU without you
knowing. The trap line prints which line failed.
Array Jobs
When you need to run the same script with different parameters, array jobs are far better than submitting 100 separate sbatch calls:
#!/bin/bash #SBATCH --job-name=hp-sweep #SBATCH --array=0-99 #SBATCH --gres=gpu:1 #SBATCH --time=04:00:00 #SBATCH --output=logs/sweep_%A_%a.out # %A = array master job ID, %a = array task index LEARNING_RATES=(1e-2 1e-3 1e-4 5e-4 1e-5) LR_IDX=$((SLURM_ARRAY_TASK_ID % 5)) SEED=$((SLURM_ARRAY_TASK_ID / 5)) python train.py --lr ${LEARNING_RATES[$LR_IDX]} --seed $SEED
Array Job Controls
# Submit array of 100 jobs, max 10 running at once sbatch --array=0-99%10 sweep.sh # Cancel specific array tasks scancel 12345_[50-99] # Check array job status squeue -j 12345 -r # -r shows individual tasks
Array vs Loop
Don’t submit in a bash loop (for i in {1..100}; do sbatch ...; done). Array jobs are a single submission that Slurm can schedule more efficiently, and you get one master job ID to manage all tasks.
Job Dependencies
Chain jobs together with --dependency so downstream steps only run when upstream steps succeed:
# Submit pipeline with dependencies DATA_JOB=$(sbatch --parsable data-prep.sh) TRAIN_JOB=$(sbatch --parsable --dependency=afterok:$DATA_JOB train.sh) EVAL_JOB=$(sbatch --parsable --dependency=afterok:$TRAIN_JOB eval.sh)
Dependency Types
| Flag | Meaning | Use Case |
|---|---|---|
afterok:JOBID | Run only if dependency succeeded | Standard pipeline |
afternotok:JOBID | Run only if dependency failed | Retry/recovery logic |
afterany:JOBID | Run regardless of outcome | Cleanup scripts |
after:JOBID | Run after dependency starts | Overlapping stages |
afterok:JOB1:JOB2 | Run after ALL listed jobs succeed | Fan-in from parallel tasks |
Use --parsable
Always use sbatch --parsable when capturing job IDs for dependencies.
Without it, sbatch returns a human-readable string like “Submitted
batch job 12345” that you’d need to parse.
Partitions and QOS
Partitions group nodes by capability: a gpu partition might have nodes with A100s, while cpu has larger memory but no GPUs. Each partition has its own limits (max wall time, max nodes per job, priority weight).
Quality of Service (QOS) adds another layer: a high QOS might allow longer wall times or higher priority, while preempt allows jobs that can be killed when higher-priority work arrives.
# Check available partitions sinfo -o "%P %N %G %c %m %l" # PARTITION NODES GRES CPUS MEMORY TIMELIMIT # cpu* node[01-10] (null) 64 256000 7-00:00:00 # gpu node[11-14] gpu:a100:4 64 512000 3-00:00:00
Common Pitfalls
1. Forgetting --time
Without a time limit, Slurm uses the partition default (often very short). Your 24-hour training run gets killed after 1 hour.
2. Requesting more resources than exist
--gres=gpu:8 on a 4-GPU node queues the job forever. Always check sinfo for available resources.
3. Not using %j in output paths
Multiple jobs writing to the same output file overwrite each other. Always include the job ID: --output=logs/%j.out.
4. Ignoring squeue during long waits
A job stuck in PENDING usually means resources are unavailable or the request exceeds partition limits. Check squeue --start to see the estimated start time, or scontrol show job <id> for the reason.
Troubleshooting
Why Is My Job PENDING?
Check the reason with squeue -j JOBID -o "%R":
| Reason | Meaning | Fix |
|---|---|---|
| Resources | Not enough free nodes/GPUs | Wait, or reduce resource request |
| Priority | Lower priority than other jobs | Wait, or check sprio -j JOBID for your score |
| Dependency | Waiting for another job | Check dependent job with scontrol show job |
| QOSMaxJobsPerUserLimit | Too many running jobs | Wait for current jobs to finish |
| ReqNodeNotAvail | Requested nodes are down | Use sinfo -N to find available nodes |
| AssocGrpCpuLimit | Account CPU limit exceeded | Check limits with sacctmgr show assoc user=$USER |
Why Did My Job Fail?
# Check exit code sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode # Exit code format: return_code:signal # 0:0 = success # 1:0 = script error (check your code) # 0:9 = killed by signal 9 (SIGKILL — OOM or time limit) # 0:15 = killed by signal 15 (SIGTERM — preempted or scancel)
Graceful Shutdown on Time Limit
#SBATCH --signal=B:USR1@300 # Send SIGUSR1 300 seconds before time limit # In your Python script: import signal def handler(signum, frame): save_checkpoint() sys.exit(0) signal.signal(signal.SIGUSR1, handler)
Key Takeaways
-
Three daemons — slurmctld (scheduler), slurmd (execution), slurmdbd (accounting). Each has a distinct role.
-
Three submission commands — sbatch (batch), srun (inline/parallel), salloc (interactive allocation).
-
Jobs follow a state machine — PENDING → RUNNING → COMPLETED. A job stays PENDING until resources are available.
-
Partitions and QOS control access — which hardware you can use, for how long, and at what priority.
Related Concepts
- Slurm GPU Allocation: How to request and manage GPUs with
--gresfor distributed training - Slurm Resource Management: Monitoring jobs, understanding priority, and fair-share scheduling
- Distributed Parallelism: Data and model parallelism patterns that run on Slurm clusters
- Multi-GPU Communication: NCCL and collective operations across Slurm-allocated GPUs
Further Reading
- Slurm Workload Manager Documentation - Official SchedMD documentation covering all Slurm commands and configuration
- Introduction to Slurm - SchedMD's official quick start guide for new users
- NERSC Slurm Guide - NERSC's practical guide to job submission on production HPC systems
