Skip to main content

Slurm Fundamentals: Job Scheduling on HPC Clusters

Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.

Best viewed on desktop for optimal interactive experience

What Is Slurm?

Slurm (Simple Linux Utility for Resource Management) is the dominant job scheduler for HPC clusters. It decides which jobs run, on which nodes, and when. If you’ve trained a model on a multi-node GPU cluster, Slurm almost certainly managed the resource allocation.

At its core, Slurm solves a bin-packing problem: given N compute nodes with finite CPUs, GPUs, and memory, schedule M jobs to maximize utilization while respecting resource constraints and fairness policies.

Architecture: The Three Daemons

Slurm runs three processes that coordinate job scheduling across the cluster.

slurmctld (Controller)

The central brain. It maintains the cluster state — which nodes are up, which jobs are queued, which resources are free. All scheduling decisions happen here. Typically runs on a dedicated management node with a backup for high availability.

slurmd (Compute Node Daemon)

Runs on every compute node. It receives job assignments from slurmctld, launches user processes, monitors resource usage, and reports status back. Think of it as the local execution agent.

slurmdbd (Database Daemon)

Optional but common in production. Records job history, resource usage, and accounting data to a database (usually MySQL/MariaDB). Enables sacct queries and fair-share scheduling.

Why Three Daemons?

Separation of concerns: the controller makes scheduling decisions, the node daemons execute them, and the database daemon records everything. This architecture scales to clusters with thousands of nodes.

Core Commands

sbatch — Submit Batch Jobs

The most common command. Submits a job script to the queue and returns immediately with a job ID. The script runs later when resources become available.

sbatch train.sh # Submitted batch job 12345

srun — Run Interactive/Parallel Tasks

Runs a command directly on allocated resources. Used for interactive work or to launch parallel tasks within an existing allocation. Blocks until the command completes.

srun --nodes=2 --ntasks=8 python distributed_train.py

salloc — Allocate Resources

Requests an interactive allocation without running anything. You get a shell on the allocated resources and can run commands manually. Useful for debugging.

salloc --nodes=1 --gres=gpu:2 --time=01:00:00 # salloc: Granted job allocation 12346

Job Lifecycle

Every Slurm job follows the same state machine: submitted → queued → scheduled → running → completed (or failed). The transition from PENDING to RUNNING depends on resource availability, priority, and partition constraints.

Monitoring Jobs

Three commands give you visibility into what’s happening on the cluster:

squeue — What’s Running Now

# Your jobs squeue -u $USER # All GPU jobs with custom format squeue -p gpu -o "%.8i %.9P %.20j %.8u %.2t %M %l %R" # Estimated start time for pending jobs squeue --start -u $USER

sacct — What Already Happened

# Job history with efficiency sacct -j 12345 --format=JobID,JobName,State,Elapsed,TotalCPU,MaxRSS,ExitCode # All your jobs from the last 7 days sacct -u $USER --starttime=$(date -d '7 days ago' +%Y-%m-%d) --format=JobID,JobName,State,Elapsed

scontrol — Full Job Details

# Everything about a specific job scontrol show job 12345 # Key fields to check: Reason (why PENDING), ExitCode (why FAILED), RunTime vs TimeLimit

Writing Job Scripts

A Slurm job script is a shell script with #SBATCH directives that specify resource requirements:

#!/bin/bash #SBATCH --job-name=train-resnet #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:a100:4 #SBATCH --cpus-per-task=8 #SBATCH --mem=128G #SBATCH --time=08:00:00 #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=user@example.com set -euo pipefail trap 'echo "Job $SLURM_JOB_ID failed at line $LINENO" >&2' ERR # --- Environment --- module load cuda/12.1 source activate myenv # --- Run --- srun python train.py --epochs 100 --checkpoint-dir /scratch/$USER/$SLURM_JOB_ID

Key directives:

  • --job-name: Human-readable name (shows in squeue)
  • --partition: Which queue to submit to (determines available hardware)
  • --nodes: Number of compute nodes
  • --ntasks-per-node: Processes per node
  • --gres: Generic resources like GPUs
  • --time: Maximum wall time (job killed if exceeded)
  • --output/--error: Where stdout/stderr go (%j expands to job ID)

Error Handling Matters

set -euo pipefail stops the script on any error. Without it, a failed module load silently continues, and your training runs on CPU without you knowing. The trap line prints which line failed.

Array Jobs

When you need to run the same script with different parameters, array jobs are far better than submitting 100 separate sbatch calls:

#!/bin/bash #SBATCH --job-name=hp-sweep #SBATCH --array=0-99 #SBATCH --gres=gpu:1 #SBATCH --time=04:00:00 #SBATCH --output=logs/sweep_%A_%a.out # %A = array master job ID, %a = array task index LEARNING_RATES=(1e-2 1e-3 1e-4 5e-4 1e-5) LR_IDX=$((SLURM_ARRAY_TASK_ID % 5)) SEED=$((SLURM_ARRAY_TASK_ID / 5)) python train.py --lr ${LEARNING_RATES[$LR_IDX]} --seed $SEED

Array Job Controls

# Submit array of 100 jobs, max 10 running at once sbatch --array=0-99%10 sweep.sh # Cancel specific array tasks scancel 12345_[50-99] # Check array job status squeue -j 12345 -r # -r shows individual tasks

Array vs Loop

Don’t submit in a bash loop (for i in {1..100}; do sbatch ...; done). Array jobs are a single submission that Slurm can schedule more efficiently, and you get one master job ID to manage all tasks.

Job Dependencies

Chain jobs together with --dependency so downstream steps only run when upstream steps succeed:

# Submit pipeline with dependencies DATA_JOB=$(sbatch --parsable data-prep.sh) TRAIN_JOB=$(sbatch --parsable --dependency=afterok:$DATA_JOB train.sh) EVAL_JOB=$(sbatch --parsable --dependency=afterok:$TRAIN_JOB eval.sh)

Dependency Types

FlagMeaningUse Case
afterok:JOBIDRun only if dependency succeededStandard pipeline
afternotok:JOBIDRun only if dependency failedRetry/recovery logic
afterany:JOBIDRun regardless of outcomeCleanup scripts
after:JOBIDRun after dependency startsOverlapping stages
afterok:JOB1:JOB2Run after ALL listed jobs succeedFan-in from parallel tasks

Use --parsable

Always use sbatch --parsable when capturing job IDs for dependencies. Without it, sbatch returns a human-readable string like “Submitted batch job 12345” that you’d need to parse.

Partitions and QOS

Partitions group nodes by capability: a gpu partition might have nodes with A100s, while cpu has larger memory but no GPUs. Each partition has its own limits (max wall time, max nodes per job, priority weight).

Quality of Service (QOS) adds another layer: a high QOS might allow longer wall times or higher priority, while preempt allows jobs that can be killed when higher-priority work arrives.

# Check available partitions sinfo -o "%P %N %G %c %m %l" # PARTITION NODES GRES CPUS MEMORY TIMELIMIT # cpu* node[01-10] (null) 64 256000 7-00:00:00 # gpu node[11-14] gpu:a100:4 64 512000 3-00:00:00

Common Pitfalls

1. Forgetting --time

Without a time limit, Slurm uses the partition default (often very short). Your 24-hour training run gets killed after 1 hour.

2. Requesting more resources than exist

--gres=gpu:8 on a 4-GPU node queues the job forever. Always check sinfo for available resources.

3. Not using %j in output paths

Multiple jobs writing to the same output file overwrite each other. Always include the job ID: --output=logs/%j.out.

4. Ignoring squeue during long waits

A job stuck in PENDING usually means resources are unavailable or the request exceeds partition limits. Check squeue --start to see the estimated start time, or scontrol show job <id> for the reason.

Troubleshooting

Why Is My Job PENDING?

Check the reason with squeue -j JOBID -o "%R":

ReasonMeaningFix
ResourcesNot enough free nodes/GPUsWait, or reduce resource request
PriorityLower priority than other jobsWait, or check sprio -j JOBID for your score
DependencyWaiting for another jobCheck dependent job with scontrol show job
QOSMaxJobsPerUserLimitToo many running jobsWait for current jobs to finish
ReqNodeNotAvailRequested nodes are downUse sinfo -N to find available nodes
AssocGrpCpuLimitAccount CPU limit exceededCheck limits with sacctmgr show assoc user=$USER

Why Did My Job Fail?

# Check exit code sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode # Exit code format: return_code:signal # 0:0 = success # 1:0 = script error (check your code) # 0:9 = killed by signal 9 (SIGKILL — OOM or time limit) # 0:15 = killed by signal 15 (SIGTERM — preempted or scancel)

Graceful Shutdown on Time Limit

#SBATCH --signal=B:USR1@300 # Send SIGUSR1 300 seconds before time limit # In your Python script: import signal def handler(signum, frame): save_checkpoint() sys.exit(0) signal.signal(signal.SIGUSR1, handler)

Key Takeaways

  1. Three daemons — slurmctld (scheduler), slurmd (execution), slurmdbd (accounting). Each has a distinct role.

  2. Three submission commands — sbatch (batch), srun (inline/parallel), salloc (interactive allocation).

  3. Jobs follow a state machine — PENDING → RUNNING → COMPLETED. A job stays PENDING until resources are available.

  4. Partitions and QOS control access — which hardware you can use, for how long, and at what priority.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon