Monitoring Your Cluster
Three commands give you complete visibility into a Slurm cluster’s state.
squeue — View the Job Queue
Shows all pending and running jobs. The most frequently used monitoring command.
# All jobs squeue # Your jobs only squeue -u $USER # Detailed format squeue -o "%.8i %.9P %.20j %.8u %.2t %.10M %.6D %R" # JOBID PARTITION NAME USER ST TIME NODES REASON # 12345 gpu train-resnet abhik R 2:15:30 2 None # 12346 gpu evaluate abhik PD 0:00 1 Resources
Key state codes: PD (pending), R (running), CG (completing), CD (completed), F (failed), TO (timeout).
The REASON column for pending jobs tells you why: Resources (no free nodes), Priority (lower priority than others), QOSMaxJobsPerUser (hit quota).
sinfo — Node and Partition Status
Shows the cluster’s hardware landscape: which partitions exist, how many nodes are idle, allocated, or down.
sinfo -o "%P %a %l %D %t %N" # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST # cpu* up 7-00:00:0 10 idle node[01-10] # gpu up 3-00:00:0 2 alloc node[11-12] # gpu up 3-00:00:0 2 idle node[13-14]
Node states: idle (free), alloc (fully allocated), mix (partially allocated), drain (being taken offline), down (unavailable).
sacct — Historical Job Data
Queries the job database for completed jobs. Essential for post-mortem analysis: how much memory did the job actually use? How long did it run?
sacct -j 12345 --format=JobID,JobName,Elapsed,MaxRSS,State,ExitCode # JobID JobName Elapsed MaxRSS State ExitCode # 12345 train-resn+ 08:15:42 45.2G COMPLETED 0:0 # 12345.0 srun 08:15:40 42.1G COMPLETED 0:0
The .0 Suffix
sacct shows both the job allocation (12345) and individual steps (12345.0, 12345.1). The step-level data has the real resource usage — the allocation-level data is a summary.
How Slurm Decides What Runs Next
When multiple jobs are PENDING and resources free up, Slurm must choose which job to schedule first. The answer: the multi-factor priority formula.
The Priority Formula
Job_priority = PriorityWeightAge * age_factor + PriorityWeightFairshare * fairshare_factor + PriorityWeightJobSize * jobsize_factor + PriorityWeightQOS * qos_factor + PriorityWeightPartition * partition_factor
Each factor is normalized to [0, 1], then multiplied by its weight (configured in slurm.conf). The resulting scores are summed into a single priority number. Higher priority = scheduled sooner.
Priority Factors Explained
-
Age: How long the job has been waiting. Increases linearly until
PriorityMaxAge(typically 7 days), then saturates. Prevents indefinite starvation. -
Fair-Share: Based on your historical resource consumption vs your allocation target. If you’ve used less than your share, your score is high. If you’ve been consuming heavily, it drops. This is the main fairness mechanism.
-
Job Size: Larger jobs get a slight priority boost because they’re harder to schedule. This encourages the scheduler to schedule them when large blocks of resources are available.
-
QOS Priority: An administrator-assigned priority per Quality of Service level. A
highQOS might add 1000 priority points. -
Partition Priority: Some partitions are more valuable than others. A
gpupartition might have higher base priority thancpu.
Backfill Scheduling
Strict priority ordering creates a problem: if the highest-priority job needs 64 nodes and only 4 are free, all 4 nodes sit idle until 60 more become available. Backfill solves this.
The backfill scheduler looks at lower-priority jobs and asks: “can this small job run and finish before the large job’s resources become available?” If yes, it starts the small job now, improving utilization without delaying the large job.
This is why specifying accurate --time limits matters. If your job says --time=7-00:00:00 but actually runs for 2 hours, the backfill scheduler can’t use those 7 days of “reserved” time for other jobs.
Accurate Time Limits Matter
Overestimating --time hurts cluster utilization. The backfill scheduler
reserves resources for the full requested duration. A 2-hour job with a 7-day
limit blocks backfill for 7 days worth of scheduling decisions.
Fair-Share: The Fairness Engine
Fair-share scheduling prevents any single user or group from monopolizing the cluster. Each user (or account) has a target share of resources. As you consume resources, your fair-share score drops, and your future jobs get lower priority.
The decay is configurable: recent usage matters more than old usage. A common setup uses a half-life of 14 days — your heavy usage two weeks ago has half the impact of yesterday’s usage.
# Check your fair-share score sshare -u $USER # Account User RawShares NormShares RawUsage EffectvUsage FairShare # mygroup abhik 100 0.10 45231 0.15 0.667
A fair-share value of 1.0 means you’ve used less than your share (high priority). A value near 0.0 means you’ve consumed well beyond your allocation (low priority).
Common Pitfalls
1. Not checking squeue --start
If your job has been PENDING for hours, squeue --start shows the estimated start time. If it says N/A, the job likely has an unsatisfiable request.
2. Ignoring sacct for debugging
When a job fails, check sacct -j <id> --format=State,ExitCode,MaxRSS,Elapsed. OOM kills show as OUT_OF_MEMORY, timeouts as TIMEOUT.
3. Requesting maximum time limits
Always request realistic time limits. --time=7-00:00:00 for a 3-hour job hurts everyone by blocking backfill scheduling.
4. Not monitoring fair-share
If your jobs are consistently low priority, check sshare. You may be exceeding your allocation. Spread submissions over time or coordinate with your team.
Key Takeaways
-
Three monitoring commands — squeue (queue), sinfo (cluster), sacct (history). Use all three.
-
Priority is multi-factor — age, fair-share, job size, QOS, and partition all contribute. Fair-share usually dominates.
-
Backfill improves utilization — accurate time limits help small jobs fill gaps while large jobs wait for resources.
-
Fair-share enforces fairness — recent heavy usage lowers your priority. Spread consumption to maintain fair-share score.
Related Concepts
- Slurm Fundamentals: Core commands, job lifecycle, and partitions
- Slurm GPU Allocation: GPU-specific resource management
- Distributed Parallelism: Training strategies that depend on Slurm resource management
Further Reading
- Slurm Priority/Multifactor Plugin - Official documentation on how Slurm calculates job priority
- Slurm Fair-share Scheduling - SchedMD's explanation of the fair-tree algorithm
- squeue, sinfo, sacct man pages - Complete command reference for Slurm monitoring tools
