The Noisy Neighbor Problem
Linux cgroups exist because of the noisy-neighbor problem: imagine an apartment building where one tenant throws a party every night at full volume, and without rules (limits) they ruin everyone else's experience. On a Linux system, the noisy neighbor might be a process that:
- Consumes 100% CPU, starving other processes
- Allocates all available memory, triggering the OOM killer
- Saturates disk I/O, making the system unresponsive
Control Groups (cgroups) solve this by letting you set resource limits on groups of processes. While namespaces provide isolation (hiding resources), cgroups provide allocation (limiting resources).
Analogy: Budget Allocation
CPU quota = Time budget (hours employees can work)
Memory limit = Office space (square footage allocated)
I/O bandwidth = Shared equipment usage time
PIDs limit = Headcount cap
Departments (process groups) must work within their budgets regardless of how much total resource exists.
cgroups Architecture
cgroups organize processes into a hierarchy where each node can have resource limits. Understanding this structure is key to effective container resource management.
Cgroup Hierarchy Explorer
Explore the difference between cgroups v1 (separate hierarchies per controller) and v2 (unified hierarchy). Click nodes to see their details.
Click on any cgroup in the hierarchy to view its details and resource limits.
cgroups v1
- • Separate hierarchy per controller
- • Process can be in different groups per controller
- • More flexible but complex
- • Legacy, but still widely used
cgroups v2
- • Single unified hierarchy
- • Process belongs to exactly one cgroup
- • Controllers enabled per-cgroup
- • Default on modern systems (kernel 5.x+)
Key Concepts
| Concept | Description |
|---|---|
| Hierarchy | Tree structure of cgroups (directories in /sys/fs/cgroup) |
| Controller | A resource type that can be limited (CPU, memory, I/O, PIDs) |
| Cgroup | A node in the hierarchy, a directory containing limit files |
| Task | A process or thread assigned to a cgroup |
The cgroup Filesystem
cgroups are controlled through a pseudo-filesystem, typically mounted at /sys/fs/cgroup:
$ ls /sys/fs/cgroup/ cgroup.controllers cpu.pressure memory.current cgroup.max.depth cpu.stat memory.max cgroup.max.descendants io.max memory.min cgroup.procs io.pressure pids.current cgroup.subtree_control io.stat pids.max
To limit a process, you write values to these files:
# Create a cgroup mkdir /sys/fs/cgroup/myapp # Set memory limit to 512MB echo 536870912 > /sys/fs/cgroup/myapp/memory.max # Set CPU limit to 50% of one core echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max # Add process to the cgroup echo $PID > /sys/fs/cgroup/myapp/cgroup.procs
cgroups v1 vs v2
Linux has two versions of cgroups, with v2 being the modern default:
| Feature | cgroups v1 | cgroups v2 |
|---|---|---|
| Hierarchy | Separate per controller | Single unified |
| Process membership | Can differ per controller | One cgroup only |
| Delegation | Complex, error-prone | Clean subtree delegation |
| Default (2024+) | Legacy | Default |
| Docker support | Full | Full (recent versions) |
Why Unified Hierarchy Matters
In v1, a process could be in /cpu/app for CPU limits but /memory/web for memory limits. This created confusion and made resource accounting inconsistent.
v2's unified hierarchy means a process is in exactly one cgroup, and that cgroup can have multiple controllers enabled. This is simpler to manage and reason about.
# v2: Enable controllers for a cgroup echo "+cpu +memory +io" > /sys/fs/cgroup/myapp/cgroup.subtree_control # Now child cgroups can use these controllers mkdir /sys/fs/cgroup/myapp/worker1 echo 268435456 > /sys/fs/cgroup/myapp/worker1/memory.max
Resource Controllers
CPU Controller
The CPU controller limits how much CPU time a cgroup can use. The key mechanism is CFS bandwidth throttling.
CPU Bandwidth Throttling (CFS Quota)
Visualize how cgroups CPU quota/period controls CPU bandwidth. The process can use CPU freely until it exhausts its quota, then it's throttled until the next period.
CPU Timeline
Current Period
CFS Quota Settings
Formula:
CPU% = quota_us / period_us × 100
A quota of 50000µs with 100000µs period = 50% of one CPU core.
How Docker Uses This
docker run --cpus=0.5 sets quota=50000, period=100000 (50% of one CPU)
docker run --cpu-period=100000 --cpu-quota=200000 allows using 2 CPU cores worth of time
CPU Settings
| File | Description | Example |
|---|---|---|
cpu.max | Quota and period in µs | 50000 100000 = 50% |
cpu.weight | Proportional share (1-10000) | 100 = default |
cpu.pressure | PSI metrics (stall time) | Read-only |
Quota Math
CPU cores usable = quota_us / period_us # Examples: 50000/100000 = 0.5 cores (50% of one core) 200000/100000 = 2.0 cores (can use 2 cores fully) max/100000 = unlimited (the default)
Memory Controller
The memory controller limits RAM usage and handles memory pressure:
Resource Limit Simulator
Adjust cgroup limits and watch how different workloads get throttled. See CPU throttling, memory pressure, and I/O limits in action.
Current Usage
Running Normally
All resources within limits. Process running at full speed.
Try this: Run the Memory Hog workload with a 50% memory limit. Watch memory climb until it hits the limit, then the OOM killer terminates the process - exactly what happens in real containers!
Memory Settings
| File | Description |
|---|---|
memory.max | Hard limit (OOM kill if exceeded) |
memory.high | Soft limit (throttle allocations) |
memory.low | Memory protection (won't reclaim unless necessary) |
memory.min | Guaranteed minimum (never reclaim) |
memory.current | Current usage |
memory.swap.max | Swap limit |
The OOM Killer
When a cgroup exceeds memory.max and can't reclaim pages:
- Kernel triggers the OOM killer
- OOM killer selects a process in the cgroup to kill
- Process is sent SIGKILL
- Memory is freed
This is exactly what happens when a Docker container runs out of memory!
# Check if a cgroup has had OOM events cat /sys/fs/cgroup/docker/abc123/memory.events # oom 5 # oom_kill 5
I/O Controller
Limits disk bandwidth and IOPS:
| File | Description |
|---|---|
io.max | Bandwidth/IOPS limits per device |
io.weight | Proportional weight (1-10000) |
io.pressure | PSI stall metrics |
# Limit to 10MB/s read, 5MB/s write on device 8:0 echo "8:0 rbps=10485760 wbps=5242880" > io.max # Limit to 1000 read IOPS, 500 write IOPS echo "8:0 riops=1000 wiops=500" > io.max
PIDs Controller
Prevents fork bombs by limiting the number of processes:
# Limit to 100 processes echo 100 > /sys/fs/cgroup/myapp/pids.max # Check current count cat /sys/fs/cgroup/myapp/pids.current # 47
How Docker Uses cgroups
When you run docker run with resource flags:
docker run \ --cpus=0.5 \ # cpu.max = "50000 100000" --memory=512m \ # memory.max = 536870912 --memory-swap=512m \ # memory.swap.max = 0 (no swap) --pids-limit=100 \ # pids.max = 100 --device-read-bps /dev/sda:10mb \ # io.max nginx
Docker creates a cgroup at /sys/fs/cgroup/docker/<container-id>/ and configures all these limits automatically.
📋 Inspect Docker Container cgroups (click to expand)
# Find container's cgroup path docker inspect --format '{{.HostConfig.CgroupParent}}' mycontainer # View all limits for a container CONTAINER_ID=$(docker inspect --format '{{.Id}}' mycontainer) cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max cat /sys/fs/cgroup/docker/$CONTAINER_ID/cpu.max # Real-time resource usage docker stats mycontainer # Detailed cgroup info cat /proc/$(docker inspect --format '{{.State.Pid}}' mycontainer)/cgroup
systemd and cgroups
systemd uses cgroups v2 extensively for service management. Every unit gets its own cgroup:
# View systemd cgroup structure systemd-cgls # Cgroup for a service systemctl show docker.service --property=ControlGroup # ControlGroup=/system.slice/docker.service # Resource usage systemctl status docker.service # Memory: 150.4M # CPU: 2.341s
Setting Limits via systemd
# /etc/systemd/system/myapp.service [Service] ExecStart=/usr/bin/myapp MemoryMax=512M CPUQuota=50% TasksMax=100 IOWeight=50
Pressure Stall Information (PSI)
Linux 4.20+ provides PSI metrics showing when processes are stalled waiting for resources:
cat /sys/fs/cgroup/docker/abc123/cpu.pressure # some avg10=0.00 avg60=0.00 avg300=0.00 total=123456 # full avg10=0.00 avg60=0.00 avg300=0.00 total=0 cat /sys/fs/cgroup/docker/abc123/memory.pressure # some avg10=5.23 avg60=3.15 avg300=1.82 total=987654321 # full avg10=2.11 avg60=1.03 avg300=0.54 total=123456789
| Metric | Meaning |
|---|---|
some | Percentage of time some tasks are stalled |
full | Percentage of time all tasks are stalled |
avg10/60/300 | Averages over 10s, 60s, 5min |
PSI is invaluable for detecting resource contention before it becomes critical.
Practical cgroup Management
📋 Common cgroup Commands (click to expand)
# Create a cgroup mkdir /sys/fs/cgroup/mygroup # Enable controllers for child cgroups echo "+cpu +memory +io +pids" > /sys/fs/cgroup/mygroup/cgroup.subtree_control # Create a child cgroup mkdir /sys/fs/cgroup/mygroup/worker # Set limits echo 100000 > /sys/fs/cgroup/mygroup/worker/cpu.max echo 268435456 > /sys/fs/cgroup/mygroup/worker/memory.max echo 50 > /sys/fs/cgroup/mygroup/worker/pids.max # Add current shell to the cgroup echo $$ > /sys/fs/cgroup/mygroup/worker/cgroup.procs # View current cgroup cat /proc/self/cgroup # 0::/mygroup/worker # View processes in a cgroup cat /sys/fs/cgroup/mygroup/worker/cgroup.procs # View cgroup events (OOM kills, etc.) cat /sys/fs/cgroup/mygroup/worker/memory.events # Remove a cgroup (must be empty) rmdir /sys/fs/cgroup/mygroup/worker
Delegation (Unprivileged cgroup Management)
cgroups v2 allows delegating subtrees to non-root users:
# Create a cgroup for user 1000 mkdir /sys/fs/cgroup/user-1000 chown -R 1000:1000 /sys/fs/cgroup/user-1000 # User can now create and manage child cgroups su - user1000 mkdir /sys/fs/cgroup/user-1000/myapp echo $$ > /sys/fs/cgroup/user-1000/myapp/cgroup.procs
This enables rootless containers (like Podman) to manage their own resource limits.
Common Pitfalls
Watch Out For
- Kernel memory accounting
Page cache and kernel structures count toward memory limits. A 512MB container might OOM even with 400MB heap because of buffer cache.
- CPU throttling latency
A container throttled to 10% CPU might have 100ms latency spikes at period boundaries. For latency-sensitive apps, use larger periods.
- I/O limits and buffered I/O
I/O limits apply to direct I/O. Buffered writes go to page cache first and may exceed limits temporarily.
- cgroups v1/v2 mixing
Don't mix v1 and v2 for the same controller. Modern systems should use v2 exclusively.
Essential Takeaways
cgroups limit resources, namespaces isolate visibility
- both needed for containers
v2 unified hierarchy is the modern default - one cgroup per process
CPU quota/period controls bandwidth: 50000/100000 = 50% of one core
memory.max triggers OOM killer when exceeded - the "out of memory" container crash
Everything is files in /sys/fs/cgroup - read/write to control limits
PSI metrics reveal resource pressure before failures occur
systemd manages cgroups for services automatically via unit files
Docker flags like --cpus, --memory translate directly to cgroup settings
When to use cgroups (and when nice or ulimit is enough)
cgroups are the right answer when you need enforced, hierarchical, aggregable limits on a group of processes — anything else is a softer best-effort. They're overkill when you only need to deprioritise one process or cap a single resource for a single user.
Reach for cgroups when:
- You're running containers — Docker, Podman, systemd-nspawn, and Kubernetes all use cgroups under the hood. If you're tuning container CPU or memory limits, you're tuning cgroups.
- You need hard memory caps that kill the offender instead of degrading the whole machine —
memory.maxtriggers the OOM killer scoped to the cgroup, not host-wide. - You're running multi-tenant workloads on shared hardware — Spark executors, CI runners, model-serving replicas — and need each tenant's CPU stolen back from a noisy neighbour.
- You want per-service accounting — systemd's
CPUAccounting=andMemoryAccounting=flags create cgroups sosystemd-cgtopcan show what each unit costs. - You need CPU pinning plus bandwidth limits together —
cpuset.cpuspluscpu.maxlets you say "this workload runs on cores 8-15 and never uses more than 4 cores' worth of time."
Stay with nice / ulimit / taskset when:
- You just want a single process to yield CPU under contention —
nice -n 19 ./batch_jobis a one-line answer that doesn't need a cgroup hierarchy. - You're enforcing per-user file-descriptor or process-count caps —
/etc/security/limits.confhandles that without root-owned cgroup tooling. - You're on a kernel that predates cgroups v2 and don't want to deal with the v1 controller-per-mountpoint mess — most distro defaults moved to v2 around 2022, but legacy hosts exist.
- The workload is a single short-lived process — cgroup setup costs more in operator time than the run itself takes.
- You need CPU affinity but not bandwidth limits —
taskset -c 8-15 ./workloadis simpler and doesn't require a cgroup.
The honest default on modern Linux: if you're running services through systemd, you're already getting cgroups v2 for free (one slice per service, one scope per session). Tune the unit file with CPUQuota=, MemoryMax=, and IOWeight= instead of writing to cgroup.controllers by hand.
Related concepts
Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.
Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.
Master Linux namespaces — the kernel mechanism that makes containers possible. Learn how mount, PID, network, and user namespaces create isolated environments, with interactive demos.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Visualize the complete Linux boot sequence from BIOS/UEFI to login. Learn how GRUB, kernel, and systemd work together with interactive visualizations.
Learn the Btrfs filesystem with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.
