Slurm Resource Management and Job Priority

How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).

March 15, 20268 min|hpc slurm job-scheduling resource-management

|

Best viewed on desktop for optimal interactive experience

Monitoring Your Cluster

Three commands give you complete visibility into a Slurm cluster’s state.

squeue — View the Job Queue

Shows all pending and running jobs. The most frequently used monitoring command.

# All jobs
squeue

# Your jobs only
squeue -u $USER

# Detailed format
squeue -o "%.8i %.9P %.20j %.8u %.2t %.10M %.6D %R"
# JOBID  PARTITION  NAME            USER     ST  TIME      NODES  REASON
# 12345  gpu        train-resnet    abhik    R   2:15:30   2      None
# 12346  gpu        evaluate        abhik    PD  0:00      1      Resources

Key state codes: PD (pending), R (running), CG (completing), CD (completed), F (failed), TO (timeout).

The REASON column for pending jobs tells you why: Resources (no free nodes), Priority (lower priority than others), QOSMaxJobsPerUser (hit quota).

sinfo — Node and Partition Status

Shows the cluster’s hardware landscape: which partitions exist, how many nodes are idle, allocated, or down.

sinfo -o "%P %a %l %D %t %N"
# PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
# cpu*       up     7-00:00:0  10     idle   node[01-10]
# gpu        up     3-00:00:0  2      alloc  node[11-12]
# gpu        up     3-00:00:0  2      idle   node[13-14]

Node states: idle (free), alloc (fully allocated), mix (partially allocated), drain (being taken offline), down (unavailable).

sacct — Historical Job Data

Queries the job database for completed jobs. Essential for post-mortem analysis: how much memory did the job actually use? How long did it run?

sacct -j 12345 --format=JobID,JobName,Elapsed,MaxRSS,State,ExitCode
# JobID       JobName      Elapsed    MaxRSS     State      ExitCode
# 12345       train-resn+  08:15:42   45.2G      COMPLETED  0:0
# 12345.0     srun         08:15:40   42.1G      COMPLETED  0:0

The .0 Suffix

sacct shows both the job allocation (12345) and individual steps (12345.0, 12345.1). The step-level data has the real resource usage — the allocation-level data is a summary.

How Slurm Decides What Runs Next

When multiple jobs are PENDING and resources free up, Slurm must choose which job to schedule first. The answer: the multi-factor priority formula.

The Priority Formula

Job_priority = PriorityWeightAge * age_factor
             + PriorityWeightFairshare * fairshare_factor
             + PriorityWeightJobSize * jobsize_factor
             + PriorityWeightQOS * qos_factor
             + PriorityWeightPartition * partition_factor

Each factor is normalized to [0, 1], then multiplied by its weight (configured in slurm.conf). The resulting scores are summed into a single priority number. Higher priority = scheduled sooner.

Priority Factors Explained

Age: How long the job has been waiting. Increases linearly until PriorityMaxAge (typically 7 days), then saturates. Prevents indefinite starvation.
Fair-Share: Based on your historical resource consumption vs your allocation target. If you’ve used less than your share, your score is high. If you’ve been consuming heavily, it drops. This is the main fairness mechanism.
Job Size: Larger jobs get a slight priority boost because they’re harder to schedule. This encourages the scheduler to schedule them when large blocks of resources are available.
QOS Priority: An administrator-assigned priority per Quality of Service level. A high QOS might add 1000 priority points.
Partition Priority: Some partitions are more valuable than others. A gpu partition might have higher base priority than cpu.

Backfill Scheduling

Strict priority ordering creates a problem: if the highest-priority job needs 64 nodes and only 4 are free, all 4 nodes sit idle until 60 more become available. Backfill solves this.

The backfill scheduler looks at lower-priority jobs and asks: “can this small job run and finish before the large job’s resources become available?” If yes, it starts the small job now, improving utilization without delaying the large job.

This is why specifying accurate --time limits matters. If your job says --time=7-00:00:00 but actually runs for 2 hours, the backfill scheduler can’t use those 7 days of “reserved” time for other jobs.

Accurate Time Limits Matter

Overestimating --time hurts cluster utilization. The backfill scheduler reserves resources for the full requested duration. A 2-hour job with a 7-day limit blocks backfill for 7 days worth of scheduling decisions.

Fair-share scheduling prevents any single user or group from monopolizing the cluster. Each user (or account) has a target share of resources. As you consume resources, your fair-share score drops, and your future jobs get lower priority.

The decay is configurable: recent usage matters more than old usage. A common setup uses a half-life of 14 days — your heavy usage two weeks ago has half the impact of yesterday’s usage.

# Check your fair-share score
sshare -u $USER
# Account  User   RawShares  NormShares  RawUsage  EffectvUsage  FairShare
# mygroup  abhik  100        0.10        45231     0.15          0.667

A fair-share value of 1.0 means you’ve used less than your share (high priority). A value near 0.0 means you’ve consumed well beyond your allocation (low priority).

Three monitoring commands — squeue (queue), sinfo (cluster), sacct (history). Use all three.
Priority is multi-factor — age, fair-share, job size, QOS, and partition all contribute. Fair-share usually dominates.
Backfill improves utilization — accurate time limits help small jobs fill gaps while large jobs wait for resources.
Fair-share enforces fairness — recent heavy usage lowers your priority. Spread consumption to maintain fair-share score.

Slurm Fundamentals: Core commands, job lifecycle, and partitions
Slurm GPU Allocation: GPU-specific resource management
Distributed Parallelism: Training strategies that depend on Slurm resource management

Slurm Resource Management and Job Priority

Monitoring Your Cluster

squeue — View the Job Queue

sinfo — Node and Partition Status

sacct — Historical Job Data

How Slurm Decides What Runs Next

The Priority Formula

Priority Factors Explained

Backfill Scheduling

Common Pitfalls

1. Not checking `squeue --start`

2. Ignoring sacct for debugging

3. Requesting maximum time limits

Key Takeaways

Further Reading