Slurm Accounting and Resource Tracking

How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.

March 15, 20268 min|hpc slurm accounting resource-management

|

Best viewed on desktop for optimal interactive experience

Why Accounting Matters

On a shared HPC cluster, resources are finite. Without accounting, a single user running large training jobs could starve everyone else. Slurm’s accounting subsystem solves this by tracking who uses what, enforcing per-user and per-group limits, and feeding usage data into the fair-share scheduler.

Accounting also enables capacity planning. By analyzing historical usage with sreport, administrators can justify hardware purchases, identify underutilized partitions, and set allocation targets that reflect actual demand.

The Association Model

Slurm organizes users into a tree: cluster → account → user. Every user must belong to an account, and accounts can be nested. This hierarchy is managed by sacctmgr, the accounting management command.

# Create an account under the root
sacctmgr add account ml-group Description="Machine Learning Team"

# Create a child account
sacctmgr add account nlp-team parent=ml-group

# Add a user to an account
sacctmgr add user alice Account=ml-group

# View the full association tree
sacctmgr show associations format=Cluster,Account,User,Share,GrpTRES

Every row in this tree is called an association. An association ties a user to an account on a specific cluster and defines what resources they can consume. A user can belong to multiple accounts (e.g., a professor in both cs-dept and interdisciplinary-lab), but one association is marked as the default.

Parent-Child Inheritance

Resource limits cascade downward. If ml-group has a GrpTRES limit of 20 GPUs, all users under ml-group combined cannot exceed 20 GPUs — even if each user’s individual limit is set to 16.

TRES: Trackable Resources

TRES (Trackable Resources) is Slurm’s unified system for metering different resource types. Instead of tracking CPUs, GPUs, and memory separately, everything is expressed in TRES units with configurable billing weights.

Billing Weights

In slurm.conf, administrators define how expensive each resource type is:

TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=16.0"

This says: 1 GPU costs as much as 16 CPUs. A job requesting 4 GPUs and 32 CPUs is billed as 4 × 16 + 32 × 1 = 96 billing units. This normalization lets fair-share treat CPU-heavy and GPU-heavy jobs equitably.

Without TRES weighting, a user running 100 small CPU jobs would accumulate the same “usage” as someone running 100 GPU jobs — despite consuming far fewer actual resources. TRES billing corrects this imbalance.

# Check TRES billing for a completed job
sacct -j 12345 --format=JobID,TRESUsageInTot,TRESUsageOutTot

Resource Limits

Limits are the enforcement layer. They prevent any single user or group from consuming more than their allocation. Limits are set per-association via sacctmgr.

Group Limits (GrpTRES)

Cap the total resources consumed by an account and all its users combined:

# ml-group can use at most 20 GPUs simultaneously
sacctmgr modify account ml-group set GrpTRES=gres/gpu=20,cpu=160

When a job would push the group total over GrpTRES, the job stays PENDING with reason AssocGrpCpuLimit or AssocGrpGRES.

Per-Job Limits (MaxTRES)

Cap what a single job can request:

# No single job from alice can request more than 8 GPUs
sacctmgr modify user alice set MaxTRESPerJob=gres/gpu=8,cpu=64

Wall Time Limits

# ml-group users cannot run jobs longer than 3 days
sacctmgr modify account ml-group set MaxWall=3-00:00:00

# Total accumulated wall time for the group
sacctmgr modify account ml-group set GrpWall=30-00:00:00

Limits Are Hard Caps

Unlike fair-share (which adjusts priority softly), GrpTRES and MaxTRES are hard limits. A job that would exceed them is blocked immediately — no amount of waiting will start it. Check sacctmgr show associations if your jobs are stuck with an AssocGrp reason.

Usage Reporting

sreport — Cluster and Account Utilization

The primary tool for understanding who used what over time:

# Cluster utilization for the last month
sreport cluster utilization Start=2026-02-15 End=2026-03-15

# Top 10 users by GPU hours
sreport user TopUsage Start=2026-02-15 End=2026-03-15 \
  TopCount=10 -t Hours --tres=gres/gpu

# Account-level usage breakdown
sreport cluster AccountUtilizationByUser Start=2026-02-15 \
  End=2026-03-15 Accounts=ml-group

sacct — Job-Level Detail

While sreport gives aggregate views, sacct drills into individual jobs:

# See TRES billing for recent jobs
sacct -u $USER --format=JobID,JobName,TRESUsageInTot,Elapsed,State \
  --starttime=2026-03-01

Usage feeds directly into fair-share scheduling. Heavy consumers get lower priority:

sshare -u $USER -l
# Account  User   RawShares  NormShares  RawUsage  NormUsage  EffectvUsage  FairShare
# ml-group alice  100        0.10        45231     0.15       0.12          0.667

A FairShare value near 1.0 means you’ve used less than your share. Near 0.0 means you’ve consumed well beyond your allocation and your jobs will be deprioritized.

Associations form a tree — cluster → account → user. Limits cascade from parent to child. Use sacctmgr to manage.
TRES normalizes billing — GPUs, CPUs, and memory are weighted so fair-share treats all resource types equitably.
GrpTRES are hard limits — unlike fair-share (soft priority), group limits block jobs immediately when exceeded.
sreport for capacity planning — use historical utilization data to justify allocations and identify waste.

Slurm Fundamentals: Core commands, job lifecycle, and the three daemons (including slurmdbd)
Slurm Resource Management: Priority formula, fair-share scoring, and backfill scheduling
Slurm GPU Allocation: GPU-specific resource requests and CUDA_VISIBLE_DEVICES mapping

Slurm Accounting and Resource Tracking

Why Accounting Matters

The Association Model

TRES: Trackable Resources

Billing Weights

Resource Limits

Group Limits (GrpTRES)

Per-Job Limits (MaxTRES)

Wall Time Limits

Usage Reporting

sreport — Cluster and Account Utilization

sacct — Job-Level Detail

Common Pitfalls

1. Not Checking Association Limits

2. Ignoring TRES Billing Weights

3. Default Account Confusion

4. Not Using sreport for Capacity Planning

Key Takeaways

Further Reading