Skip to main content

Slurm Resource Management and Job Priority

How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).

Best viewed on desktop for optimal interactive experience

Monitoring Your Cluster

Three commands give you complete visibility into a Slurm cluster’s state.

squeue — View the Job Queue

Shows all pending and running jobs. The most frequently used monitoring command.

# All jobs squeue # Your jobs only squeue -u $USER # Detailed format squeue -o "%.8i %.9P %.20j %.8u %.2t %.10M %.6D %R" # JOBID PARTITION NAME USER ST TIME NODES REASON # 12345 gpu train-resnet abhik R 2:15:30 2 None # 12346 gpu evaluate abhik PD 0:00 1 Resources

Key state codes: PD (pending), R (running), CG (completing), CD (completed), F (failed), TO (timeout).

The REASON column for pending jobs tells you why: Resources (no free nodes), Priority (lower priority than others), QOSMaxJobsPerUser (hit quota).

sinfo — Node and Partition Status

Shows the cluster’s hardware landscape: which partitions exist, how many nodes are idle, allocated, or down.

sinfo -o "%P %a %l %D %t %N" # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST # cpu* up 7-00:00:0 10 idle node[01-10] # gpu up 3-00:00:0 2 alloc node[11-12] # gpu up 3-00:00:0 2 idle node[13-14]

Node states: idle (free), alloc (fully allocated), mix (partially allocated), drain (being taken offline), down (unavailable).

sacct — Historical Job Data

Queries the job database for completed jobs. Essential for post-mortem analysis: how much memory did the job actually use? How long did it run?

sacct -j 12345 --format=JobID,JobName,Elapsed,MaxRSS,State,ExitCode # JobID JobName Elapsed MaxRSS State ExitCode # 12345 train-resn+ 08:15:42 45.2G COMPLETED 0:0 # 12345.0 srun 08:15:40 42.1G COMPLETED 0:0

The .0 Suffix

sacct shows both the job allocation (12345) and individual steps (12345.0, 12345.1). The step-level data has the real resource usage — the allocation-level data is a summary.

How Slurm Decides What Runs Next

When multiple jobs are PENDING and resources free up, Slurm must choose which job to schedule first. The answer: the multi-factor priority formula.

The Priority Formula

Job_priority = PriorityWeightAge * age_factor + PriorityWeightFairshare * fairshare_factor + PriorityWeightJobSize * jobsize_factor + PriorityWeightQOS * qos_factor + PriorityWeightPartition * partition_factor

Each factor is normalized to [0, 1], then multiplied by its weight (configured in slurm.conf). The resulting scores are summed into a single priority number. Higher priority = scheduled sooner.

Priority Factors Explained

  • Age: How long the job has been waiting. Increases linearly until PriorityMaxAge (typically 7 days), then saturates. Prevents indefinite starvation.

  • Fair-Share: Based on your historical resource consumption vs your allocation target. If you’ve used less than your share, your score is high. If you’ve been consuming heavily, it drops. This is the main fairness mechanism.

  • Job Size: Larger jobs get a slight priority boost because they’re harder to schedule. This encourages the scheduler to schedule them when large blocks of resources are available.

  • QOS Priority: An administrator-assigned priority per Quality of Service level. A high QOS might add 1000 priority points.

  • Partition Priority: Some partitions are more valuable than others. A gpu partition might have higher base priority than cpu.

Backfill Scheduling

Strict priority ordering creates a problem: if the highest-priority job needs 64 nodes and only 4 are free, all 4 nodes sit idle until 60 more become available. Backfill solves this.

The backfill scheduler looks at lower-priority jobs and asks: “can this small job run and finish before the large job’s resources become available?” If yes, it starts the small job now, improving utilization without delaying the large job.

This is why specifying accurate --time limits matters. If your job says --time=7-00:00:00 but actually runs for 2 hours, the backfill scheduler can’t use those 7 days of “reserved” time for other jobs.

Accurate Time Limits Matter

Overestimating --time hurts cluster utilization. The backfill scheduler reserves resources for the full requested duration. A 2-hour job with a 7-day limit blocks backfill for 7 days worth of scheduling decisions.

Fair-Share: The Fairness Engine

Fair-share scheduling prevents any single user or group from monopolizing the cluster. Each user (or account) has a target share of resources. As you consume resources, your fair-share score drops, and your future jobs get lower priority.

The decay is configurable: recent usage matters more than old usage. A common setup uses a half-life of 14 days — your heavy usage two weeks ago has half the impact of yesterday’s usage.

# Check your fair-share score sshare -u $USER # Account User RawShares NormShares RawUsage EffectvUsage FairShare # mygroup abhik 100 0.10 45231 0.15 0.667

A fair-share value of 1.0 means you’ve used less than your share (high priority). A value near 0.0 means you’ve consumed well beyond your allocation (low priority).

Common Pitfalls

1. Not checking squeue --start

If your job has been PENDING for hours, squeue --start shows the estimated start time. If it says N/A, the job likely has an unsatisfiable request.

2. Ignoring sacct for debugging

When a job fails, check sacct -j <id> --format=State,ExitCode,MaxRSS,Elapsed. OOM kills show as OUT_OF_MEMORY, timeouts as TIMEOUT.

3. Requesting maximum time limits

Always request realistic time limits. --time=7-00:00:00 for a 3-hour job hurts everyone by blocking backfill scheduling.

4. Not monitoring fair-share

If your jobs are consistently low priority, check sshare. You may be exceeding your allocation. Spread submissions over time or coordinate with your team.

Key Takeaways

  1. Three monitoring commands — squeue (queue), sinfo (cluster), sacct (history). Use all three.

  2. Priority is multi-factor — age, fair-share, job size, QOS, and partition all contribute. Fair-share usually dominates.

  3. Backfill improves utilization — accurate time limits help small jobs fill gaps while large jobs wait for resources.

  4. Fair-share enforces fairness — recent heavy usage lowers your priority. Spread consumption to maintain fair-share score.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon