Linux cgroups: Resource Limits for Processes

Master cgroups to limit CPU, memory, and I/O for process groups. Understand cgroups v1 vs v2, the hierarchical structure, and how containers use them.

Best viewed on desktop for optimal interactive experience

The Noisy Neighbor Problem

Imagine an apartment building where one tenant decides to throw a party every night with music at full volume. Without any rules (limits), they ruin everyone else's experience. On a Linux system, the "noisy neighbor" might be a process that:

  • Consumes 100% CPU, starving other processes
  • Allocates all available memory, triggering the OOM killer
  • Saturates disk I/O, making the system unresponsive

Control Groups (cgroups) solve this by letting you set resource limits on groups of processes. While namespaces provide isolation (hiding resources), cgroups provide allocation (limiting resources).

Analogy: Budget Allocation

Think of cgroups like departmental budgets in a company:

  • CPU quota = Time budget (hours employees can work)
  • Memory limit = Office space (square footage allocated)
  • I/O bandwidth = Shared equipment usage time
  • PIDs limit = Headcount cap

Departments (process groups) must work within their budgets regardless of how much total resource exists.

cgroups Architecture

cgroups organize processes into a hierarchy where each node can have resource limits. Understanding this structure is key to effective container resource management.

Cgroup Hierarchy Explorer

Explore the difference between cgroups v1 (separate hierarchies per controller) and v2 (unified hierarchy). Click nodes to see their details.

/sys/fs/cgroup (unified)
/
cpumemoryio+1
system.slice
cpumemoryio
docker.service
cpumemory
1 procs
sshd.service
cpumemory
1 procs
user.slice
cpumemoryio+1
user-1000.slice
cpumemorypids
docker
cpumemoryio+1
abc123...2 procs
def456...1 procs
Select a Node

Click on any cgroup in the hierarchy to view its details and resource limits.

cgroups v1

  • • Separate hierarchy per controller
  • • Process can be in different groups per controller
  • • More flexible but complex
  • • Legacy, but still widely used

cgroups v2

  • • Single unified hierarchy
  • • Process belongs to exactly one cgroup
  • • Controllers enabled per-cgroup
  • • Default on modern systems (kernel 5.x+)

Key Concepts

ConceptDescription
HierarchyTree structure of cgroups (directories in /sys/fs/cgroup)
ControllerA resource type that can be limited (CPU, memory, I/O, PIDs)
CgroupA node in the hierarchy, a directory containing limit files
TaskA process or thread assigned to a cgroup

The cgroup Filesystem

cgroups are controlled through a pseudo-filesystem, typically mounted at /sys/fs/cgroup:

$ ls /sys/fs/cgroup/ cgroup.controllers cpu.pressure memory.current cgroup.max.depth cpu.stat memory.max cgroup.max.descendants io.max memory.min cgroup.procs io.pressure pids.current cgroup.subtree_control io.stat pids.max

To limit a process, you write values to these files:

# Create a cgroup mkdir /sys/fs/cgroup/myapp # Set memory limit to 512MB echo 536870912 > /sys/fs/cgroup/myapp/memory.max # Set CPU limit to 50% of one core echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max # Add process to the cgroup echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

cgroups v1 vs v2

Linux has two versions of cgroups, with v2 being the modern default:

Featurecgroups v1cgroups v2
HierarchySeparate per controllerSingle unified
Process membershipCan differ per controllerOne cgroup only
DelegationComplex, error-proneClean subtree delegation
Default (2024+)LegacyDefault
Docker supportFullFull (recent versions)

Why Unified Hierarchy Matters

In v1, a process could be in /cpu/app for CPU limits but /memory/web for memory limits. This created confusion and made resource accounting inconsistent.

v2's unified hierarchy means a process is in exactly one cgroup, and that cgroup can have multiple controllers enabled. This is simpler to manage and reason about.

# v2: Enable controllers for a cgroup echo "+cpu +memory +io" > /sys/fs/cgroup/myapp/cgroup.subtree_control # Now child cgroups can use these controllers mkdir /sys/fs/cgroup/myapp/worker1 echo 268435456 > /sys/fs/cgroup/myapp/worker1/memory.max

Resource Controllers

CPU Controller

The CPU controller limits how much CPU time a cgroup can use. The key mechanism is CFS bandwidth throttling.

CPU Bandwidth Throttling (CFS Quota)

Visualize how cgroups CPU quota/period controls CPU bandwidth. The process can use CPU freely until it exhausts its quota, then it's throttled until the next period.

CPU Timeline

Running
Throttled
Idle
Period boundary
marks new period

Current Period

Quota Used0% / 50%
0
Slots Running
0
Slots Throttled
0
Periods Elapsed

CFS Quota Settings

50000
10%100%
cpu.cfs_period_us100000
Effective CPU50%

Formula:

CPU% = quota_us / period_us × 100

A quota of 50000µs with 100000µs period = 50% of one CPU core.

How Docker Uses This

docker run --cpus=0.5 sets quota=50000, period=100000 (50% of one CPU)

docker run --cpu-period=100000 --cpu-quota=200000 allows using 2 CPU cores worth of time

CPU Settings

FileDescriptionExample
cpu.maxQuota and period in µs50000 100000 = 50%
cpu.weightProportional share (1-10000)100 = default
cpu.pressurePSI metrics (stall time)Read-only

Quota Math

CPU cores usable = quota_us / period_us # Examples: 50000/100000 = 0.5 cores (50% of one core) 200000/100000 = 2.0 cores (can use 2 cores fully) max/100000 = unlimited (the default)

Memory Controller

The memory controller limits RAM usage and handles memory pressure:

Resource Limit Simulator

Adjust cgroup limits and watch how different workloads get throttled. See CPU throttling, memory pressure, and I/O limits in action.

100%
100%
100%

Current Usage

CPU
0.0%
Memory
0.0%
I/O
0.0%
Running Normally

All resources within limits. Process running at full speed.

Try this: Run the Memory Hog workload with a 50% memory limit. Watch memory climb until it hits the limit, then the OOM killer terminates the process - exactly what happens in real containers!

Memory Settings

FileDescription
memory.maxHard limit (OOM kill if exceeded)
memory.highSoft limit (throttle allocations)
memory.lowMemory protection (won't reclaim unless necessary)
memory.minGuaranteed minimum (never reclaim)
memory.currentCurrent usage
memory.swap.maxSwap limit

The OOM Killer

When a cgroup exceeds memory.max and can't reclaim pages:

  1. Kernel triggers the OOM killer
  2. OOM killer selects a process in the cgroup to kill
  3. Process is sent SIGKILL
  4. Memory is freed

This is exactly what happens when a Docker container runs out of memory!

# Check if a cgroup has had OOM events cat /sys/fs/cgroup/docker/abc123/memory.events # oom 5 # oom_kill 5

I/O Controller

Limits disk bandwidth and IOPS:

FileDescription
io.maxBandwidth/IOPS limits per device
io.weightProportional weight (1-10000)
io.pressurePSI stall metrics
# Limit to 10MB/s read, 5MB/s write on device 8:0 echo "8:0 rbps=10485760 wbps=5242880" > io.max # Limit to 1000 read IOPS, 500 write IOPS echo "8:0 riops=1000 wiops=500" > io.max

PIDs Controller

Prevents fork bombs by limiting the number of processes:

# Limit to 100 processes echo 100 > /sys/fs/cgroup/myapp/pids.max # Check current count cat /sys/fs/cgroup/myapp/pids.current # 47

How Docker Uses cgroups

When you run docker run with resource flags:

docker run \ --cpus=0.5 \ # cpu.max = "50000 100000" --memory=512m \ # memory.max = 536870912 --memory-swap=512m \ # memory.swap.max = 0 (no swap) --pids-limit=100 \ # pids.max = 100 --device-read-bps /dev/sda:10mb \ # io.max nginx

Docker creates a cgroup at /sys/fs/cgroup/docker/<container-id>/ and configures all these limits automatically.

📋 Inspect Docker Container cgroups (click to expand)

# Find container's cgroup path docker inspect --format '{{.HostConfig.CgroupParent}}' mycontainer # View all limits for a container CONTAINER_ID=$(docker inspect --format '{{.Id}}' mycontainer) cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max cat /sys/fs/cgroup/docker/$CONTAINER_ID/cpu.max # Real-time resource usage docker stats mycontainer # Detailed cgroup info cat /proc/$(docker inspect --format '{{.State.Pid}}' mycontainer)/cgroup

systemd and cgroups

systemd uses cgroups v2 extensively for service management. Every unit gets its own cgroup:

# View systemd cgroup structure systemd-cgls # Cgroup for a service systemctl show docker.service --property=ControlGroup # ControlGroup=/system.slice/docker.service # Resource usage systemctl status docker.service # Memory: 150.4M # CPU: 2.341s

Setting Limits via systemd

# /etc/systemd/system/myapp.service [Service] ExecStart=/usr/bin/myapp MemoryMax=512M CPUQuota=50% TasksMax=100 IOWeight=50

Pressure Stall Information (PSI)

Linux 4.20+ provides PSI metrics showing when processes are stalled waiting for resources:

cat /sys/fs/cgroup/docker/abc123/cpu.pressure # some avg10=0.00 avg60=0.00 avg300=0.00 total=123456 # full avg10=0.00 avg60=0.00 avg300=0.00 total=0 cat /sys/fs/cgroup/docker/abc123/memory.pressure # some avg10=5.23 avg60=3.15 avg300=1.82 total=987654321 # full avg10=2.11 avg60=1.03 avg300=0.54 total=123456789
MetricMeaning
somePercentage of time some tasks are stalled
fullPercentage of time all tasks are stalled
avg10/60/300Averages over 10s, 60s, 5min

PSI is invaluable for detecting resource contention before it becomes critical.

Practical cgroup Management

📋 Common cgroup Commands (click to expand)

# Create a cgroup mkdir /sys/fs/cgroup/mygroup # Enable controllers for child cgroups echo "+cpu +memory +io +pids" > /sys/fs/cgroup/mygroup/cgroup.subtree_control # Create a child cgroup mkdir /sys/fs/cgroup/mygroup/worker # Set limits echo 100000 > /sys/fs/cgroup/mygroup/worker/cpu.max echo 268435456 > /sys/fs/cgroup/mygroup/worker/memory.max echo 50 > /sys/fs/cgroup/mygroup/worker/pids.max # Add current shell to the cgroup echo $$ > /sys/fs/cgroup/mygroup/worker/cgroup.procs # View current cgroup cat /proc/self/cgroup # 0::/mygroup/worker # View processes in a cgroup cat /sys/fs/cgroup/mygroup/worker/cgroup.procs # View cgroup events (OOM kills, etc.) cat /sys/fs/cgroup/mygroup/worker/memory.events # Remove a cgroup (must be empty) rmdir /sys/fs/cgroup/mygroup/worker

Delegation (Unprivileged cgroup Management)

cgroups v2 allows delegating subtrees to non-root users:

# Create a cgroup for user 1000 mkdir /sys/fs/cgroup/user-1000 chown -R 1000:1000 /sys/fs/cgroup/user-1000 # User can now create and manage child cgroups su - user1000 mkdir /sys/fs/cgroup/user-1000/myapp echo $$ > /sys/fs/cgroup/user-1000/myapp/cgroup.procs

This enables rootless containers (like Podman) to manage their own resource limits.

Common Pitfalls

Watch Out For

1. Kernel memory accounting

Page cache and kernel structures count toward memory limits. A 512MB container might OOM even with 400MB heap because of buffer cache.

2. CPU throttling latency

A container throttled to 10% CPU might have 100ms latency spikes at period boundaries. For latency-sensitive apps, use larger periods.

3. I/O limits and buffered I/O

I/O limits apply to direct I/O. Buffered writes go to page cache first and may exceed limits temporarily.

4. cgroups v1/v2 mixing

Don't mix v1 and v2 for the same controller. Modern systems should use v2 exclusively.

Essential Takeaways

1.cgroups limit resources, namespaces isolate visibility - both needed for containers
2.v2 unified hierarchy is the modern default - one cgroup per process
3.CPU quota/period controls bandwidth: 50000/100000 = 50% of one core
4.memory.max triggers OOM killer when exceeded - the "out of memory" container crash
5.Everything is files in /sys/fs/cgroup - read/write to control limits
6.PSI metrics reveal resource pressure before failures occur
7.systemd manages cgroups for services automatically via unit files
8.Docker flags like --cpus, --memory translate directly to cgroup settings

If you found this explanation helpful, consider sharing it with others.

Mastodon