Slurm Fundamentals: Job Scheduling on HPC Clusters

What Is Slurm?

Slurm (Simple Linux Utility for Resource Management) is the dominant job scheduler for HPC clusters. It decides which jobs run, on which nodes, and when. If you’ve trained a model on a multi-node GPU cluster, Slurm almost certainly managed the resource allocation.

At its core, Slurm solves a bin-packing problem: given N compute nodes with finite CPUs, GPUs, and memory, schedule M jobs to maximize utilization while respecting resource constraints and fairness policies.

Architecture: The Three Daemons

Slurm runs three processes that coordinate job scheduling across the cluster.

slurmctld (Controller)

The central brain. It maintains the cluster state — which nodes are up, which jobs are queued, which resources are free. All scheduling decisions happen here. Typically runs on a dedicated management node with a backup for high availability.

slurmd (Compute Node Daemon)

Runs on every compute node. It receives job assignments from slurmctld, launches user processes, monitors resource usage, and reports status back. Think of it as the local execution agent.

slurmdbd (Database Daemon)

Optional but common in production. Records job history, resource usage, and accounting data to a database (usually MySQL/MariaDB). Enables sacct queries and fair-share scheduling.

Why Three Daemons?

Separation of concerns: the controller makes scheduling decisions, the node daemons execute them, and the database daemon records everything. This architecture scales to clusters with thousands of nodes.

Core Commands

sbatch — Submit Batch Jobs

The most common command. Submits a job script to the queue and returns immediately with a job ID. The script runs later when resources become available.

sbatch train.sh
# Submitted batch job 12345

srun — Run Interactive/Parallel Tasks

Runs a command directly on allocated resources. Used for interactive work or to launch parallel tasks within an existing allocation. Blocks until the command completes.

srun --nodes=2 --ntasks=8 python distributed_train.py

salloc — Allocate Resources

Requests an interactive allocation without running anything. You get a shell on the allocated resources and can run commands manually. Useful for debugging.

salloc --nodes=1 --gres=gpu:2 --time=01:00:00
# salloc: Granted job allocation 12346

Job Lifecycle

Every Slurm job follows the same state machine: submitted → queued → scheduled → running → completed (or failed). The transition from PENDING to RUNNING depends on resource availability, priority, and partition constraints.

Monitoring Jobs

Three commands give you visibility into what’s happening on the cluster:

squeue — What’s Running Now

# Your jobs
squeue -u $USER

# All GPU jobs with custom format
squeue -p gpu -o "%.8i %.9P %.20j %.8u %.2t %M %l %R"

# Estimated start time for pending jobs
squeue --start -u $USER

sacct — What Already Happened

# Job history with efficiency
sacct -j 12345 --format=JobID,JobName,State,Elapsed,TotalCPU,MaxRSS,ExitCode

# All your jobs from the last 7 days
sacct -u $USER --starttime=$(date -d '7 days ago' +%Y-%m-%d) --format=JobID,JobName,State,Elapsed

scontrol — Full Job Details

# Everything about a specific job
scontrol show job 12345

# Key fields to check: Reason (why PENDING), ExitCode (why FAILED), RunTime vs TimeLimit

Writing Job Scripts

A Slurm job script is a shell script with #SBATCH directives that specify resource requirements:

#!/bin/bash
#SBATCH --job-name=train-resnet
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:a100:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=08:00:00
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@example.com

set -euo pipefail
trap 'echo "Job $SLURM_JOB_ID failed at line $LINENO" >&2' ERR

# --- Environment ---
module load cuda/12.1
source activate myenv

# --- Run ---
srun python train.py --epochs 100 --checkpoint-dir /scratch/$USER/$SLURM_JOB_ID

Key directives:

--job-name: Human-readable name (shows in squeue)
--partition: Which queue to submit to (determines available hardware)
--nodes: Number of compute nodes
--ntasks-per-node: Processes per node
--gres: Generic resources like GPUs
--time: Maximum wall time (job killed if exceeded)
--output/--error: Where stdout/stderr go (%j expands to job ID)

Error Handling Matters

set -euo pipefail stops the script on any error. Without it, a failed module load silently continues, and your training runs on CPU without you knowing. The trap line prints which line failed.

Array Jobs

When you need to run the same script with different parameters, array jobs are far better than submitting 100 separate sbatch calls:

#!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --array=0-99
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --output=logs/sweep_%A_%a.out

# %A = array master job ID, %a = array task index
LEARNING_RATES=(1e-2 1e-3 1e-4 5e-4 1e-5)
LR_IDX=$((SLURM_ARRAY_TASK_ID % 5))
SEED=$((SLURM_ARRAY_TASK_ID / 5))

python train.py --lr ${LEARNING_RATES[$LR_IDX]} --seed $SEED

Array Job Controls

# Submit array of 100 jobs, max 10 running at once
sbatch --array=0-99%10 sweep.sh

# Cancel specific array tasks
scancel 12345_[50-99]

# Check array job status
squeue -j 12345 -r   # -r shows individual tasks

Array vs Loop

Don’t submit in a bash loop (for i in {1..100}; do sbatch ...; done). Array jobs are a single submission that Slurm can schedule more efficiently, and you get one master job ID to manage all tasks.

Job Dependencies

Chain jobs together with --dependency so downstream steps only run when upstream steps succeed:

# Submit pipeline with dependencies
DATA_JOB=$(sbatch --parsable data-prep.sh)
TRAIN_JOB=$(sbatch --parsable --dependency=afterok:$DATA_JOB train.sh)
EVAL_JOB=$(sbatch --parsable --dependency=afterok:$TRAIN_JOB eval.sh)

Dependency Types

Flag	Meaning	Use Case
`afterok:JOBID`	Run only if dependency succeeded	Standard pipeline
`afternotok:JOBID`	Run only if dependency failed	Retry/recovery logic
`afterany:JOBID`	Run regardless of outcome	Cleanup scripts
`after:JOBID`	Run after dependency starts	Overlapping stages
`afterok:JOB1:JOB2`	Run after ALL listed jobs succeed	Fan-in from parallel tasks

Use --parsable

Always use sbatch --parsable when capturing job IDs for dependencies. Without it, sbatch returns a human-readable string like “Submitted batch job 12345” that you’d need to parse.

Partitions and QOS

Partitions group nodes by capability: a gpu partition might have nodes with A100s, while cpu has larger memory but no GPUs. Each partition has its own limits (max wall time, max nodes per job, priority weight).

Quality of Service (QOS) adds another layer: a high QOS might allow longer wall times or higher priority, while preempt allows jobs that can be killed when higher-priority work arrives.

# Check available partitions
sinfo -o "%P %N %G %c %m %l"
# PARTITION  NODES  GRES       CPUS  MEMORY   TIMELIMIT
# cpu*       node[01-10]  (null)  64    256000   7-00:00:00
# gpu        node[11-14]  gpu:a100:4  64  512000  3-00:00:00

Reason	Meaning	Fix
Resources	Not enough free nodes/GPUs	Wait, or reduce resource request
Priority	Lower priority than other jobs	Wait, or check `sprio -j JOBID` for your score
Dependency	Waiting for another job	Check dependent job with `scontrol show job`
QOSMaxJobsPerUserLimit	Too many running jobs	Wait for current jobs to finish
ReqNodeNotAvail	Requested nodes are down	Use `sinfo -N` to find available nodes
AssocGrpCpuLimit	Account CPU limit exceeded	Check limits with `sacctmgr show assoc user=$USER`

Why Did My Job Fail?

# Check exit code
sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode

# Exit code format: return_code:signal
# 0:0  = success
# 1:0  = script error (check your code)
# 0:9  = killed by signal 9 (SIGKILL — OOM or time limit)
# 0:15 = killed by signal 15 (SIGTERM — preempted or scancel)

Graceful Shutdown on Time Limit

#SBATCH --signal=B:USR1@300  # Send SIGUSR1 300 seconds before time limit

# In your Python script:
import signal
def handler(signum, frame):
    save_checkpoint()
    sys.exit(0)
signal.signal(signal.SIGUSR1, handler)

Key Takeaways

Three daemons — slurmctld (scheduler), slurmd (execution), slurmdbd (accounting). Each has a distinct role.
Three submission commands — sbatch (batch), srun (inline/parallel), salloc (interactive allocation).
Jobs follow a state machine — PENDING → RUNNING → COMPLETED. A job stays PENDING until resources are available.
Partitions and QOS control access — which hardware you can use, for how long, and at what priority.

Slurm GPU Allocation: How to request and manage GPUs with --gres for distributed training
Slurm Resource Management: Monitoring jobs, understanding priority, and fair-share scheduling
Distributed Parallelism: Data and model parallelism patterns that run on Slurm clusters
Multi-GPU Communication: NCCL and collective operations across Slurm-allocated GPUs