Linux cgroups: Resource Limits for Processes

The Noisy Neighbor Problem

Linux cgroups exist because of the noisy-neighbor problem: imagine an apartment building where one tenant throws a party every night at full volume, and without rules (limits) they ruin everyone else's experience. On a Linux system, the noisy neighbor might be a process that:

Consumes 100% CPU, starving other processes
Allocates all available memory, triggering the OOM killer
Saturates disk I/O, making the system unresponsive

Control Groups (cgroups) solve this by letting you set resource limits on groups of processes. While namespaces provide isolation (hiding resources), cgroups provide allocation (limiting resources).

Analogy: Budget Allocation

Think of cgroups like departmental budgets in a company:

CPU quota = Time budget (hours employees can work)
Memory limit = Office space (square footage allocated)
I/O bandwidth = Shared equipment usage time
PIDs limit = Headcount cap

Departments (process groups) must work within their budgets regardless of how much total resource exists.

cgroups Architecture

cgroups organize processes into a hierarchy where each node can have resource limits. Understanding this structure is key to effective container resource management.

Cgroup Hierarchy Explorer

Explore the difference between cgroups v1 (separate hierarchies per controller) and v2 (unified hierarchy). Click nodes to see their details.

/sys/fs/cgroup (unified)

cpumemoryio+1

system.slice

cpumemoryio

docker.service

cpumemory

1 procs

sshd.service

cpumemory

1 procs

user.slice

cpumemoryio+1

user-1000.slice

cpumemorypids

docker

cpumemoryio+1

abc123...2 procs

def456...1 procs

Select a Node

Click on any cgroup in the hierarchy to view its details and resource limits.

cgroups v1

• Separate hierarchy per controller
• Process can be in different groups per controller
• More flexible but complex
• Legacy, but still widely used

cgroups v2

• Single unified hierarchy
• Process belongs to exactly one cgroup
• Controllers enabled per-cgroup
• Default on modern systems (kernel 5.x+)

Key Concepts

Concept	Description
Hierarchy	Tree structure of cgroups (directories in `/sys/fs/cgroup`)
Controller	A resource type that can be limited (CPU, memory, I/O, PIDs)
Cgroup	A node in the hierarchy, a directory containing limit files
Task	A process or thread assigned to a cgroup

The cgroup Filesystem

cgroups are controlled through a pseudo-filesystem, typically mounted at /sys/fs/cgroup:

$ ls /sys/fs/cgroup/
cgroup.controllers      cpu.pressure         memory.current
cgroup.max.depth        cpu.stat             memory.max
cgroup.max.descendants  io.max               memory.min
cgroup.procs            io.pressure          pids.current
cgroup.subtree_control  io.stat              pids.max

To limit a process, you write values to these files:

# Create a cgroup
mkdir /sys/fs/cgroup/myapp

# Set memory limit to 512MB
echo 536870912 > /sys/fs/cgroup/myapp/memory.max

# Set CPU limit to 50% of one core
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max

# Add process to the cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

cgroups v1 vs v2

Linux has two versions of cgroups, with v2 being the modern default:

Feature	cgroups v1	cgroups v2
Hierarchy	Separate per controller	Single unified
Process membership	Can differ per controller	One cgroup only
Delegation	Complex, error-prone	Clean subtree delegation
Default (2024+)	Legacy	Default
Docker support	Full	Full (recent versions)

Why Unified Hierarchy Matters

In v1, a process could be in /cpu/app for CPU limits but /memory/web for memory limits. This created confusion and made resource accounting inconsistent.

v2's unified hierarchy means a process is in exactly one cgroup, and that cgroup can have multiple controllers enabled. This is simpler to manage and reason about.

# v2: Enable controllers for a cgroup
echo "+cpu +memory +io" > /sys/fs/cgroup/myapp/cgroup.subtree_control

# Now child cgroups can use these controllers
mkdir /sys/fs/cgroup/myapp/worker1
echo 268435456 > /sys/fs/cgroup/myapp/worker1/memory.max

Resource Controllers

CPU Controller

The CPU controller limits how much CPU time a cgroup can use. The key mechanism is CFS bandwidth throttling.

CPU Bandwidth Throttling (CFS Quota)

Visualize how cgroups CPU quota/period controls CPU bandwidth. The process can use CPU freely until it exhausts its quota, then it's throttled until the next period.

CPU Timeline

Running

Throttled

Idle

Period boundary

marks new period

Current Period

Quota Used0% / 50%

Slots Running

Slots Throttled

Periods Elapsed

CFS Quota Settings

cpu.cfs_quota_us50000

10%100%

cpu.cfs_period_us100000

Effective CPU50%

Enable CPU burst (use unused quota)

Formula:

CPU% = quota_us / period_us × 100

A quota of 50000µs with 100000µs period = 50% of one CPU core.

How Docker Uses This

docker run --cpus=0.5 sets quota=50000, period=100000 (50% of one CPU)

docker run --cpu-period=100000 --cpu-quota=200000 allows using 2 CPU cores worth of time

CPU Settings

File	Description	Example
`cpu.max`	Quota and period in µs	`50000 100000` = 50%
`cpu.weight`	Proportional share (1-10000)	`100` = default
`cpu.pressure`	PSI metrics (stall time)	Read-only

Quota Math

CPU cores usable = quota_us / period_us

# Examples:
50000/100000 = 0.5 cores (50% of one core)
200000/100000 = 2.0 cores (can use 2 cores fully)
max/100000 = unlimited (the default)

Memory Controller

The memory controller limits RAM usage and handles memory pressure:

Resource Limit Simulator

Adjust cgroup limits and watch how different workloads get throttled. See CPU throttling, memory pressure, and I/O limits in action.

Workload Type

CPU Limit100%

Memory Limit100%

I/O Limit100%

Current Usage

CPU

0.0%

Memory

0.0%

I/O

0.0%

Running Normally

All resources within limits. Process running at full speed.

Try this: Run the Memory Hog workload with a 50% memory limit. Watch memory climb until it hits the limit, then the OOM killer terminates the process - exactly what happens in real containers!

Memory Settings

File	Description
`memory.max`	Hard limit (OOM kill if exceeded)
`memory.high`	Soft limit (throttle allocations)
`memory.low`	Memory protection (won't reclaim unless necessary)
`memory.min`	Guaranteed minimum (never reclaim)
`memory.current`	Current usage
`memory.swap.max`	Swap limit

The OOM Killer

When a cgroup exceeds memory.max and can't reclaim pages:

Kernel triggers the OOM killer
OOM killer selects a process in the cgroup to kill
Process is sent SIGKILL
Memory is freed

This is exactly what happens when a Docker container runs out of memory!

# Check if a cgroup has had OOM events
cat /sys/fs/cgroup/docker/abc123/memory.events
# oom 5
# oom_kill 5

I/O Controller

Limits disk bandwidth and IOPS:

File	Description
`io.max`	Bandwidth/IOPS limits per device
`io.weight`	Proportional weight (1-10000)
`io.pressure`	PSI stall metrics

# Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Limit to 1000 read IOPS, 500 write IOPS
echo "8:0 riops=1000 wiops=500" > io.max

PIDs Controller

Prevents fork bombs by limiting the number of processes:

# Limit to 100 processes
echo 100 > /sys/fs/cgroup/myapp/pids.max

# Check current count
cat /sys/fs/cgroup/myapp/pids.current
# 47

How Docker Uses cgroups

When you run docker run with resource flags:

docker run \
  --cpus=0.5 \           # cpu.max = "50000 100000"
  --memory=512m \        # memory.max = 536870912
  --memory-swap=512m \   # memory.swap.max = 0 (no swap)
  --pids-limit=100 \     # pids.max = 100
  --device-read-bps /dev/sda:10mb \  # io.max
  nginx

Docker creates a cgroup at /sys/fs/cgroup/docker/<container-id>/ and configures all these limits automatically.

📋 Inspect Docker Container cgroups (click to expand)

# Find container's cgroup path
docker inspect --format '{{.HostConfig.CgroupParent}}' mycontainer

# View all limits for a container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' mycontainer)
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max
cat /sys/fs/cgroup/docker/$CONTAINER_ID/cpu.max

# Real-time resource usage
docker stats mycontainer

# Detailed cgroup info
cat /proc/$(docker inspect --format '{{.State.Pid}}' mycontainer)/cgroup

systemd and cgroups

systemd uses cgroups v2 extensively for service management. Every unit gets its own cgroup:

# View systemd cgroup structure
systemd-cgls

# Cgroup for a service
systemctl show docker.service --property=ControlGroup
# ControlGroup=/system.slice/docker.service

# Resource usage
systemctl status docker.service
# Memory: 150.4M
# CPU: 2.341s

Setting Limits via systemd

# /etc/systemd/system/myapp.service
[Service]
ExecStart=/usr/bin/myapp
MemoryMax=512M
CPUQuota=50%
TasksMax=100
IOWeight=50

Pressure Stall Information (PSI)

Linux 4.20+ provides PSI metrics showing when processes are stalled waiting for resources:

cat /sys/fs/cgroup/docker/abc123/cpu.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=123456
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

cat /sys/fs/cgroup/docker/abc123/memory.pressure
# some avg10=5.23 avg60=3.15 avg300=1.82 total=987654321
# full avg10=2.11 avg60=1.03 avg300=0.54 total=123456789

Metric	Meaning
`some`	Percentage of time some tasks are stalled
`full`	Percentage of time all tasks are stalled
`avg10/60/300`	Averages over 10s, 60s, 5min

PSI is invaluable for detecting resource contention before it becomes critical.

Practical cgroup Management

📋 Common cgroup Commands (click to expand)

# Create a cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for child cgroups
echo "+cpu +memory +io +pids" > /sys/fs/cgroup/mygroup/cgroup.subtree_control

# Create a child cgroup
mkdir /sys/fs/cgroup/mygroup/worker

# Set limits
echo 100000 > /sys/fs/cgroup/mygroup/worker/cpu.max
echo 268435456 > /sys/fs/cgroup/mygroup/worker/memory.max
echo 50 > /sys/fs/cgroup/mygroup/worker/pids.max

# Add current shell to the cgroup
echo $$ > /sys/fs/cgroup/mygroup/worker/cgroup.procs

# View current cgroup
cat /proc/self/cgroup
# 0::/mygroup/worker

# View processes in a cgroup
cat /sys/fs/cgroup/mygroup/worker/cgroup.procs

# View cgroup events (OOM kills, etc.)
cat /sys/fs/cgroup/mygroup/worker/memory.events

# Remove a cgroup (must be empty)
rmdir /sys/fs/cgroup/mygroup/worker

Delegation (Unprivileged cgroup Management)

cgroups v2 allows delegating subtrees to non-root users:

# Create a cgroup for user 1000
mkdir /sys/fs/cgroup/user-1000
chown -R 1000:1000 /sys/fs/cgroup/user-1000

# User can now create and manage child cgroups
su - user1000
mkdir /sys/fs/cgroup/user-1000/myapp
echo $$ > /sys/fs/cgroup/user-1000/myapp/cgroup.procs

This enables rootless containers (like Podman) to manage their own resource limits.

Common Pitfalls

Watch Out For

Kernel memory accounting

Page cache and kernel structures count toward memory limits. A 512MB container might OOM even with 400MB heap because of buffer cache.

CPU throttling latency

A container throttled to 10% CPU might have 100ms latency spikes at period boundaries. For latency-sensitive apps, use larger periods.

I/O limits and buffered I/O

I/O limits apply to direct I/O. Buffered writes go to page cache first and may exceed limits temporarily.

cgroups v1/v2 mixing

Don't mix v1 and v2 for the same controller. Modern systems should use v2 exclusively.

Essential Takeaways

cgroups limit resources, namespaces isolate visibility

both needed for containers

v2 unified hierarchy is the modern default - one cgroup per process

CPU quota/period controls bandwidth: 50000/100000 = 50% of one core

memory.max triggers OOM killer when exceeded - the "out of memory" container crash

Everything is files in /sys/fs/cgroup - read/write to control limits

PSI metrics reveal resource pressure before failures occur

systemd manages cgroups for services automatically via unit files

Docker flags like --cpus, --memory translate directly to cgroup settings

When to use cgroups (and when `nice` or `ulimit` is enough)

cgroups are the right answer when you need enforced, hierarchical, aggregable limits on a group of processes — anything else is a softer best-effort. They're overkill when you only need to deprioritise one process or cap a single resource for a single user.

Reach for cgroups when:

You're running containers — Docker, Podman, systemd-nspawn, and Kubernetes all use cgroups under the hood. If you're tuning container CPU or memory limits, you're tuning cgroups.
You need hard memory caps that kill the offender instead of degrading the whole machine — memory.max triggers the OOM killer scoped to the cgroup, not host-wide.
You're running multi-tenant workloads on shared hardware — Spark executors, CI runners, model-serving replicas — and need each tenant's CPU stolen back from a noisy neighbour.
You want per-service accounting — systemd's CPUAccounting= and MemoryAccounting= flags create cgroups so systemd-cgtop can show what each unit costs.
You need CPU pinning plus bandwidth limits together — cpuset.cpus plus cpu.max lets you say "this workload runs on cores 8-15 and never uses more than 4 cores' worth of time."

Stay with nice / ulimit / taskset when:

You just want a single process to yield CPU under contention — nice -n 19 ./batch_job is a one-line answer that doesn't need a cgroup hierarchy.
You're enforcing per-user file-descriptor or process-count caps — /etc/security/limits.conf handles that without root-owned cgroup tooling.
You're on a kernel that predates cgroups v2 and don't want to deal with the v1 controller-per-mountpoint mess — most distro defaults moved to v2 around 2022, but legacy hosts exist.
The workload is a single short-lived process — cgroup setup costs more in operator time than the run itself takes.
You need CPU affinity but not bandwidth limits — taskset -c 8-15 ./workload is simpler and doesn't require a cgroup.

The honest default on modern Linux: if you're running services through systemd, you're already getting cgroups v2 for free (one slice per service, one scope per session). Tune the unit file with CPUQuota=, MemoryMax=, and IOWeight= instead of writing to cgroup.controllers by hand.

Systems & Architecture

Containers Under the Hood: From Primitives to Docker

Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.

Systems & Architecture

How Docker Works with GPUs: Device Files, Bind Mounts, and Driver Stacks

Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.

Systems & Architecture

Linux Namespaces: The Foundation of Container Isolation

Master Linux namespaces — the kernel mechanism that makes containers possible. Learn how mount, PID, network, and user namespaces create isolated environments, with interactive demos.

GPU & High-Performance Computing

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

Systems & Architecture

Linux Boot Process: From Power-On to Login

Visualize the complete Linux boot sequence from BIOS/UEFI to login. Learn how GRUB, kernel, and systemd work together with interactive visualizations.

Systems & Architecture

Btrfs: Modern Copy-on-Write Filesystem

Learn the Btrfs filesystem with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.