The Noisy Neighbor Problem
Imagine an apartment building where one tenant decides to throw a party every night with music at full volume. Without any rules (limits), they ruin everyone else's experience. On a Linux system, the "noisy neighbor" might be a process that:
- Consumes 100% CPU, starving other processes
- Allocates all available memory, triggering the OOM killer
- Saturates disk I/O, making the system unresponsive
Control Groups (cgroups) solve this by letting you set resource limits on groups of processes. While namespaces provide isolation (hiding resources), cgroups provide allocation (limiting resources).
Analogy: Budget Allocation
Think of cgroups like departmental budgets in a company:
- CPU quota = Time budget (hours employees can work)
- Memory limit = Office space (square footage allocated)
- I/O bandwidth = Shared equipment usage time
- PIDs limit = Headcount cap
Departments (process groups) must work within their budgets regardless of how much total resource exists.
cgroups Architecture
cgroups organize processes into a hierarchy where each node can have resource limits. Understanding this structure is key to effective container resource management.
Cgroup Hierarchy Explorer
Explore the difference between cgroups v1 (separate hierarchies per controller) and v2 (unified hierarchy). Click nodes to see their details.
Click on any cgroup in the hierarchy to view its details and resource limits.
cgroups v1
- • Separate hierarchy per controller
- • Process can be in different groups per controller
- • More flexible but complex
- • Legacy, but still widely used
cgroups v2
- • Single unified hierarchy
- • Process belongs to exactly one cgroup
- • Controllers enabled per-cgroup
- • Default on modern systems (kernel 5.x+)
Key Concepts
| Concept | Description |
|---|---|
| Hierarchy | Tree structure of cgroups (directories in /sys/fs/cgroup) |
| Controller | A resource type that can be limited (CPU, memory, I/O, PIDs) |
| Cgroup | A node in the hierarchy, a directory containing limit files |
| Task | A process or thread assigned to a cgroup |
The cgroup Filesystem
cgroups are controlled through a pseudo-filesystem, typically mounted at /sys/fs/cgroup:
$ ls /sys/fs/cgroup/ cgroup.controllers cpu.pressure memory.current cgroup.max.depth cpu.stat memory.max cgroup.max.descendants io.max memory.min cgroup.procs io.pressure pids.current cgroup.subtree_control io.stat pids.max
To limit a process, you write values to these files:
# Create a cgroup mkdir /sys/fs/cgroup/myapp # Set memory limit to 512MB echo 536870912 > /sys/fs/cgroup/myapp/memory.max # Set CPU limit to 50% of one core echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max # Add process to the cgroup echo $PID > /sys/fs/cgroup/myapp/cgroup.procs
cgroups v1 vs v2
Linux has two versions of cgroups, with v2 being the modern default:
| Feature | cgroups v1 | cgroups v2 |
|---|---|---|
| Hierarchy | Separate per controller | Single unified |
| Process membership | Can differ per controller | One cgroup only |
| Delegation | Complex, error-prone | Clean subtree delegation |
| Default (2024+) | Legacy | Default |
| Docker support | Full | Full (recent versions) |
Why Unified Hierarchy Matters
In v1, a process could be in /cpu/app for CPU limits but /memory/web for memory limits. This created confusion and made resource accounting inconsistent.
v2's unified hierarchy means a process is in exactly one cgroup, and that cgroup can have multiple controllers enabled. This is simpler to manage and reason about.
# v2: Enable controllers for a cgroup echo "+cpu +memory +io" > /sys/fs/cgroup/myapp/cgroup.subtree_control # Now child cgroups can use these controllers mkdir /sys/fs/cgroup/myapp/worker1 echo 268435456 > /sys/fs/cgroup/myapp/worker1/memory.max
Resource Controllers
CPU Controller
The CPU controller limits how much CPU time a cgroup can use. The key mechanism is CFS bandwidth throttling.
CPU Bandwidth Throttling (CFS Quota)
Visualize how cgroups CPU quota/period controls CPU bandwidth. The process can use CPU freely until it exhausts its quota, then it's throttled until the next period.
CPU Timeline
Current Period
CFS Quota Settings
Formula:
CPU% = quota_us / period_us × 100
A quota of 50000µs with 100000µs period = 50% of one CPU core.
How Docker Uses This
docker run --cpus=0.5 sets quota=50000, period=100000 (50% of one CPU)
docker run --cpu-period=100000 --cpu-quota=200000 allows using 2 CPU cores worth of time
CPU Settings
| File | Description | Example |
|---|---|---|
cpu.max | Quota and period in µs | 50000 100000 = 50% |
cpu.weight | Proportional share (1-10000) | 100 = default |
cpu.pressure | PSI metrics (stall time) | Read-only |
Quota Math
CPU cores usable = quota_us / period_us # Examples: 50000/100000 = 0.5 cores (50% of one core) 200000/100000 = 2.0 cores (can use 2 cores fully) max/100000 = unlimited (the default)
Memory Controller
The memory controller limits RAM usage and handles memory pressure:
Resource Limit Simulator
Adjust cgroup limits and watch how different workloads get throttled. See CPU throttling, memory pressure, and I/O limits in action.
Current Usage
Running Normally
All resources within limits. Process running at full speed.
Try this: Run the Memory Hog workload with a 50% memory limit. Watch memory climb until it hits the limit, then the OOM killer terminates the process - exactly what happens in real containers!
Memory Settings
| File | Description |
|---|---|
memory.max | Hard limit (OOM kill if exceeded) |
memory.high | Soft limit (throttle allocations) |
memory.low | Memory protection (won't reclaim unless necessary) |
memory.min | Guaranteed minimum (never reclaim) |
memory.current | Current usage |
memory.swap.max | Swap limit |
The OOM Killer
When a cgroup exceeds memory.max and can't reclaim pages:
- Kernel triggers the OOM killer
- OOM killer selects a process in the cgroup to kill
- Process is sent SIGKILL
- Memory is freed
This is exactly what happens when a Docker container runs out of memory!
# Check if a cgroup has had OOM events cat /sys/fs/cgroup/docker/abc123/memory.events # oom 5 # oom_kill 5
I/O Controller
Limits disk bandwidth and IOPS:
| File | Description |
|---|---|
io.max | Bandwidth/IOPS limits per device |
io.weight | Proportional weight (1-10000) |
io.pressure | PSI stall metrics |
# Limit to 10MB/s read, 5MB/s write on device 8:0 echo "8:0 rbps=10485760 wbps=5242880" > io.max # Limit to 1000 read IOPS, 500 write IOPS echo "8:0 riops=1000 wiops=500" > io.max
PIDs Controller
Prevents fork bombs by limiting the number of processes:
# Limit to 100 processes echo 100 > /sys/fs/cgroup/myapp/pids.max # Check current count cat /sys/fs/cgroup/myapp/pids.current # 47
How Docker Uses cgroups
When you run docker run with resource flags:
docker run \ --cpus=0.5 \ # cpu.max = "50000 100000" --memory=512m \ # memory.max = 536870912 --memory-swap=512m \ # memory.swap.max = 0 (no swap) --pids-limit=100 \ # pids.max = 100 --device-read-bps /dev/sda:10mb \ # io.max nginx
Docker creates a cgroup at /sys/fs/cgroup/docker/<container-id>/ and configures all these limits automatically.
📋 Inspect Docker Container cgroups (click to expand)
# Find container's cgroup path docker inspect --format '{{.HostConfig.CgroupParent}}' mycontainer # View all limits for a container CONTAINER_ID=$(docker inspect --format '{{.Id}}' mycontainer) cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max cat /sys/fs/cgroup/docker/$CONTAINER_ID/cpu.max # Real-time resource usage docker stats mycontainer # Detailed cgroup info cat /proc/$(docker inspect --format '{{.State.Pid}}' mycontainer)/cgroup
systemd and cgroups
systemd uses cgroups v2 extensively for service management. Every unit gets its own cgroup:
# View systemd cgroup structure systemd-cgls # Cgroup for a service systemctl show docker.service --property=ControlGroup # ControlGroup=/system.slice/docker.service # Resource usage systemctl status docker.service # Memory: 150.4M # CPU: 2.341s
Setting Limits via systemd
# /etc/systemd/system/myapp.service [Service] ExecStart=/usr/bin/myapp MemoryMax=512M CPUQuota=50% TasksMax=100 IOWeight=50
Pressure Stall Information (PSI)
Linux 4.20+ provides PSI metrics showing when processes are stalled waiting for resources:
cat /sys/fs/cgroup/docker/abc123/cpu.pressure # some avg10=0.00 avg60=0.00 avg300=0.00 total=123456 # full avg10=0.00 avg60=0.00 avg300=0.00 total=0 cat /sys/fs/cgroup/docker/abc123/memory.pressure # some avg10=5.23 avg60=3.15 avg300=1.82 total=987654321 # full avg10=2.11 avg60=1.03 avg300=0.54 total=123456789
| Metric | Meaning |
|---|---|
some | Percentage of time some tasks are stalled |
full | Percentage of time all tasks are stalled |
avg10/60/300 | Averages over 10s, 60s, 5min |
PSI is invaluable for detecting resource contention before it becomes critical.
Practical cgroup Management
📋 Common cgroup Commands (click to expand)
# Create a cgroup mkdir /sys/fs/cgroup/mygroup # Enable controllers for child cgroups echo "+cpu +memory +io +pids" > /sys/fs/cgroup/mygroup/cgroup.subtree_control # Create a child cgroup mkdir /sys/fs/cgroup/mygroup/worker # Set limits echo 100000 > /sys/fs/cgroup/mygroup/worker/cpu.max echo 268435456 > /sys/fs/cgroup/mygroup/worker/memory.max echo 50 > /sys/fs/cgroup/mygroup/worker/pids.max # Add current shell to the cgroup echo $$ > /sys/fs/cgroup/mygroup/worker/cgroup.procs # View current cgroup cat /proc/self/cgroup # 0::/mygroup/worker # View processes in a cgroup cat /sys/fs/cgroup/mygroup/worker/cgroup.procs # View cgroup events (OOM kills, etc.) cat /sys/fs/cgroup/mygroup/worker/memory.events # Remove a cgroup (must be empty) rmdir /sys/fs/cgroup/mygroup/worker
Delegation (Unprivileged cgroup Management)
cgroups v2 allows delegating subtrees to non-root users:
# Create a cgroup for user 1000 mkdir /sys/fs/cgroup/user-1000 chown -R 1000:1000 /sys/fs/cgroup/user-1000 # User can now create and manage child cgroups su - user1000 mkdir /sys/fs/cgroup/user-1000/myapp echo $$ > /sys/fs/cgroup/user-1000/myapp/cgroup.procs
This enables rootless containers (like Podman) to manage their own resource limits.
Common Pitfalls
Watch Out For
Page cache and kernel structures count toward memory limits. A 512MB container might OOM even with 400MB heap because of buffer cache.
A container throttled to 10% CPU might have 100ms latency spikes at period boundaries. For latency-sensitive apps, use larger periods.
I/O limits apply to direct I/O. Buffered writes go to page cache first and may exceed limits temporarily.
Don't mix v1 and v2 for the same controller. Modern systems should use v2 exclusively.
Essential Takeaways
Related Concepts
- Linux Namespaces: Isolation for process views (complementary to cgroups)
- Containers Under the Hood: How namespaces + cgroups create containers
- Memory Management: How Linux manages memory and page cache
- Process Management: Understanding processes and scheduling
