Skip to main content

Slurm Accounting and Resource Tracking

How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.

Best viewed on desktop for optimal interactive experience

Why Accounting Matters

On a shared HPC cluster, resources are finite. Without accounting, a single user running large training jobs could starve everyone else. Slurm’s accounting subsystem solves this by tracking who uses what, enforcing per-user and per-group limits, and feeding usage data into the fair-share scheduler.

Accounting also enables capacity planning. By analyzing historical usage with sreport, administrators can justify hardware purchases, identify underutilized partitions, and set allocation targets that reflect actual demand.

The Association Model

Slurm organizes users into a tree: cluster → account → user. Every user must belong to an account, and accounts can be nested. This hierarchy is managed by sacctmgr, the accounting management command.

# Create an account under the root sacctmgr add account ml-group Description="Machine Learning Team" # Create a child account sacctmgr add account nlp-team parent=ml-group # Add a user to an account sacctmgr add user alice Account=ml-group # View the full association tree sacctmgr show associations format=Cluster,Account,User,Share,GrpTRES

Every row in this tree is called an association. An association ties a user to an account on a specific cluster and defines what resources they can consume. A user can belong to multiple accounts (e.g., a professor in both cs-dept and interdisciplinary-lab), but one association is marked as the default.

Parent-Child Inheritance

Resource limits cascade downward. If ml-group has a GrpTRES limit of 20 GPUs, all users under ml-group combined cannot exceed 20 GPUs — even if each user’s individual limit is set to 16.

TRES: Trackable Resources

TRES (Trackable Resources) is Slurm’s unified system for metering different resource types. Instead of tracking CPUs, GPUs, and memory separately, everything is expressed in TRES units with configurable billing weights.

Billing Weights

In slurm.conf, administrators define how expensive each resource type is:

TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=16.0"

This says: 1 GPU costs as much as 16 CPUs. A job requesting 4 GPUs and 32 CPUs is billed as 4 × 16 + 32 × 1 = 96 billing units. This normalization lets fair-share treat CPU-heavy and GPU-heavy jobs equitably.

Why TRES Matters for Fair-Share

Without TRES weighting, a user running 100 small CPU jobs would accumulate the same “usage” as someone running 100 GPU jobs — despite consuming far fewer actual resources. TRES billing corrects this imbalance.

# Check TRES billing for a completed job sacct -j 12345 --format=JobID,TRESUsageInTot,TRESUsageOutTot

Resource Limits

Limits are the enforcement layer. They prevent any single user or group from consuming more than their allocation. Limits are set per-association via sacctmgr.

Group Limits (GrpTRES)

Cap the total resources consumed by an account and all its users combined:

# ml-group can use at most 20 GPUs simultaneously sacctmgr modify account ml-group set GrpTRES=gres/gpu=20,cpu=160

When a job would push the group total over GrpTRES, the job stays PENDING with reason AssocGrpCpuLimit or AssocGrpGRES.

Per-Job Limits (MaxTRES)

Cap what a single job can request:

# No single job from alice can request more than 8 GPUs sacctmgr modify user alice set MaxTRESPerJob=gres/gpu=8,cpu=64

Wall Time Limits

# ml-group users cannot run jobs longer than 3 days sacctmgr modify account ml-group set MaxWall=3-00:00:00 # Total accumulated wall time for the group sacctmgr modify account ml-group set GrpWall=30-00:00:00

Limits Are Hard Caps

Unlike fair-share (which adjusts priority softly), GrpTRES and MaxTRES are hard limits. A job that would exceed them is blocked immediately — no amount of waiting will start it. Check sacctmgr show associations if your jobs are stuck with an AssocGrp reason.

Usage Reporting

sreport — Cluster and Account Utilization

The primary tool for understanding who used what over time:

# Cluster utilization for the last month sreport cluster utilization Start=2026-02-15 End=2026-03-15 # Top 10 users by GPU hours sreport user TopUsage Start=2026-02-15 End=2026-03-15 \ TopCount=10 -t Hours --tres=gres/gpu # Account-level usage breakdown sreport cluster AccountUtilizationByUser Start=2026-02-15 \ End=2026-03-15 Accounts=ml-group

sacct — Job-Level Detail

While sreport gives aggregate views, sacct drills into individual jobs:

# See TRES billing for recent jobs sacct -u $USER --format=JobID,JobName,TRESUsageInTot,Elapsed,State \ --starttime=2026-03-01

sshare — Fair-Share Impact

Usage feeds directly into fair-share scheduling. Heavy consumers get lower priority:

sshare -u $USER -l # Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare # ml-group alice 100 0.10 45231 0.15 0.12 0.667

A FairShare value near 1.0 means you’ve used less than your share. Near 0.0 means you’ve consumed well beyond your allocation and your jobs will be deprioritized.

Common Pitfalls

1. Not Checking Association Limits

Your job is PENDING with reason AssocGrpGRES but you have free GPUs on the cluster. The problem: your account’s GrpTRES limit is exhausted. Check with sacctmgr show associations where user=$USER format=Account,GrpTRES.

2. Ignoring TRES Billing Weights

You submit 10 GPU jobs and wonder why your fair-share dropped much faster than a colleague who submitted 50 CPU jobs. Each GPU might be billed at 16x a CPU. Check sacctmgr show config | grep TRESBillingWeights.

3. Default Account Confusion

If you belong to multiple accounts, sbatch uses your default account unless you specify --account=. Running jobs under the wrong account skews fair-share for both groups.

4. Not Using sreport for Capacity Planning

Many teams request more allocation without data. sreport shows actual utilization — if your group only uses 40% of its GPU allocation, requesting more won’t help.

Key Takeaways

  1. Associations form a tree — cluster → account → user. Limits cascade from parent to child. Use sacctmgr to manage.

  2. TRES normalizes billing — GPUs, CPUs, and memory are weighted so fair-share treats all resource types equitably.

  3. GrpTRES are hard limits — unlike fair-share (soft priority), group limits block jobs immediately when exceeded.

  4. sreport for capacity planning — use historical utilization data to justify allocations and identify waste.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon