Why Accounting Matters
On a shared HPC cluster, resources are finite. Without accounting, a single user running large training jobs could starve everyone else. Slurm’s accounting subsystem solves this by tracking who uses what, enforcing per-user and per-group limits, and feeding usage data into the fair-share scheduler.
Accounting also enables capacity planning. By analyzing historical usage with sreport, administrators can justify hardware purchases, identify underutilized partitions, and set allocation targets that reflect actual demand.
The Association Model
Slurm organizes users into a tree: cluster → account → user. Every user must belong to an account, and accounts can be nested. This hierarchy is managed by sacctmgr, the accounting management command.
# Create an account under the root sacctmgr add account ml-group Description="Machine Learning Team" # Create a child account sacctmgr add account nlp-team parent=ml-group # Add a user to an account sacctmgr add user alice Account=ml-group # View the full association tree sacctmgr show associations format=Cluster,Account,User,Share,GrpTRES
Every row in this tree is called an association. An association ties a user to an account on a specific cluster and defines what resources they can consume. A user can belong to multiple accounts (e.g., a professor in both cs-dept and interdisciplinary-lab), but one association is marked as the default.
Parent-Child Inheritance
Resource limits cascade downward. If ml-group has a GrpTRES limit of 20 GPUs, all users under ml-group combined cannot exceed 20 GPUs — even if each user’s individual limit is set to 16.
TRES: Trackable Resources
TRES (Trackable Resources) is Slurm’s unified system for metering different resource types. Instead of tracking CPUs, GPUs, and memory separately, everything is expressed in TRES units with configurable billing weights.
Billing Weights
In slurm.conf, administrators define how expensive each resource type is:
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=16.0"
This says: 1 GPU costs as much as 16 CPUs. A job requesting 4 GPUs and 32 CPUs is billed as 4 × 16 + 32 × 1 = 96 billing units. This normalization lets fair-share treat CPU-heavy and GPU-heavy jobs equitably.
Why TRES Matters for Fair-Share
Without TRES weighting, a user running 100 small CPU jobs would accumulate the same “usage” as someone running 100 GPU jobs — despite consuming far fewer actual resources. TRES billing corrects this imbalance.
# Check TRES billing for a completed job sacct -j 12345 --format=JobID,TRESUsageInTot,TRESUsageOutTot
Resource Limits
Limits are the enforcement layer. They prevent any single user or group from consuming more than their allocation. Limits are set per-association via sacctmgr.
Group Limits (GrpTRES)
Cap the total resources consumed by an account and all its users combined:
# ml-group can use at most 20 GPUs simultaneously sacctmgr modify account ml-group set GrpTRES=gres/gpu=20,cpu=160
When a job would push the group total over GrpTRES, the job stays PENDING with reason AssocGrpCpuLimit or AssocGrpGRES.
Per-Job Limits (MaxTRES)
Cap what a single job can request:
# No single job from alice can request more than 8 GPUs sacctmgr modify user alice set MaxTRESPerJob=gres/gpu=8,cpu=64
Wall Time Limits
# ml-group users cannot run jobs longer than 3 days sacctmgr modify account ml-group set MaxWall=3-00:00:00 # Total accumulated wall time for the group sacctmgr modify account ml-group set GrpWall=30-00:00:00
Limits Are Hard Caps
Unlike fair-share (which adjusts priority softly), GrpTRES and MaxTRES are
hard limits. A job that would exceed them is blocked immediately — no
amount of waiting will start it. Check sacctmgr show associations if your
jobs are stuck with an AssocGrp reason.
Usage Reporting
sreport — Cluster and Account Utilization
The primary tool for understanding who used what over time:
# Cluster utilization for the last month sreport cluster utilization Start=2026-02-15 End=2026-03-15 # Top 10 users by GPU hours sreport user TopUsage Start=2026-02-15 End=2026-03-15 \ TopCount=10 -t Hours --tres=gres/gpu # Account-level usage breakdown sreport cluster AccountUtilizationByUser Start=2026-02-15 \ End=2026-03-15 Accounts=ml-group
sacct — Job-Level Detail
While sreport gives aggregate views, sacct drills into individual jobs:
# See TRES billing for recent jobs sacct -u $USER --format=JobID,JobName,TRESUsageInTot,Elapsed,State \ --starttime=2026-03-01
sshare — Fair-Share Impact
Usage feeds directly into fair-share scheduling. Heavy consumers get lower priority:
sshare -u $USER -l # Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare # ml-group alice 100 0.10 45231 0.15 0.12 0.667
A FairShare value near 1.0 means you’ve used less than your share. Near 0.0 means you’ve consumed well beyond your allocation and your jobs will be deprioritized.
Common Pitfalls
1. Not Checking Association Limits
Your job is PENDING with reason AssocGrpGRES but you have free GPUs on the cluster. The problem: your account’s GrpTRES limit is exhausted. Check with sacctmgr show associations where user=$USER format=Account,GrpTRES.
2. Ignoring TRES Billing Weights
You submit 10 GPU jobs and wonder why your fair-share dropped much faster than a colleague who submitted 50 CPU jobs. Each GPU might be billed at 16x a CPU. Check sacctmgr show config | grep TRESBillingWeights.
3. Default Account Confusion
If you belong to multiple accounts, sbatch uses your default account unless you specify --account=. Running jobs under the wrong account skews fair-share for both groups.
4. Not Using sreport for Capacity Planning
Many teams request more allocation without data. sreport shows actual utilization — if your group only uses 40% of its GPU allocation, requesting more won’t help.
Key Takeaways
-
Associations form a tree — cluster → account → user. Limits cascade from parent to child. Use sacctmgr to manage.
-
TRES normalizes billing — GPUs, CPUs, and memory are weighted so fair-share treats all resource types equitably.
-
GrpTRES are hard limits — unlike fair-share (soft priority), group limits block jobs immediately when exceeded.
-
sreport for capacity planning — use historical utilization data to justify allocations and identify waste.
Related Concepts
- Slurm Fundamentals: Core commands, job lifecycle, and the three daemons (including slurmdbd)
- Slurm Resource Management: Priority formula, fair-share scoring, and backfill scheduling
- Slurm GPU Allocation: GPU-specific resource requests and CUDA_VISIBLE_DEVICES mapping
Further Reading
- Slurm Accounting and Resource Limits - Official SchedMD documentation on accounting configuration
- sacctmgr Man Page - Complete reference for the Slurm account manager command
- Slurm TRES - Trackable Resources documentation covering billing weights and limits
- sreport Man Page - Usage reporting tool reference
