Linux cgroups: Resource Limits for Processes

The Noisy Neighbor Problem

Imagine an apartment building where one tenant decides to throw a party every night with music at full volume. Without any rules (limits), they ruin everyone else's experience. On a Linux system, the "noisy neighbor" might be a process that:

Consumes 100% CPU, starving other processes
Allocates all available memory, triggering the OOM killer
Saturates disk I/O, making the system unresponsive

Control Groups (cgroups) solve this by letting you set resource limits on groups of processes. While namespaces provide isolation (hiding resources), cgroups provide allocation (limiting resources).

Analogy: Budget Allocation

Think of cgroups like departmental budgets in a company:

CPU quota = Time budget (hours employees can work)
Memory limit = Office space (square footage allocated)
I/O bandwidth = Shared equipment usage time
PIDs limit = Headcount cap

Departments (process groups) must work within their budgets regardless of how much total resource exists.

cgroups Architecture

cgroups organize processes into a hierarchy where each node can have resource limits. Understanding this structure is key to effective container resource management.

Cgroup Hierarchy Explorer

Explore the difference between cgroups v1 (separate hierarchies per controller) and v2 (unified hierarchy). Click nodes to see their details.

/sys/fs/cgroup (unified)

cpumemoryio+1

system.slice

cpumemoryio

docker.service

cpumemory

1 procs

sshd.service

cpumemory

1 procs

user.slice

cpumemoryio+1

user-1000.slice

cpumemorypids

docker

cpumemoryio+1

abc123...2 procs

def456...1 procs

Select a Node

Click on any cgroup in the hierarchy to view its details and resource limits.

cgroups v1

• Separate hierarchy per controller
• Process can be in different groups per controller
• More flexible but complex
• Legacy, but still widely used

cgroups v2

• Single unified hierarchy
• Process belongs to exactly one cgroup
• Controllers enabled per-cgroup
• Default on modern systems (kernel 5.x+)

Key Concepts

Concept	Description
Hierarchy	Tree structure of cgroups (directories in `/sys/fs/cgroup`)
Controller	A resource type that can be limited (CPU, memory, I/O, PIDs)
Cgroup	A node in the hierarchy, a directory containing limit files
Task	A process or thread assigned to a cgroup

The cgroup Filesystem

cgroups are controlled through a pseudo-filesystem, typically mounted at /sys/fs/cgroup:

$ ls /sys/fs/cgroup/
cgroup.controllers      cpu.pressure         memory.current
cgroup.max.depth        cpu.stat             memory.max
cgroup.max.descendants  io.max               memory.min
cgroup.procs            io.pressure          pids.current
cgroup.subtree_control  io.stat              pids.max

To limit a process, you write values to these files:

# Create a cgroup
mkdir /sys/fs/cgroup/myapp

# Set memory limit to 512MB
echo 536870912 > /sys/fs/cgroup/myapp/memory.max

# Set CPU limit to 50% of one core
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max

# Add process to the cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

cgroups v1 vs v2

Linux has two versions of cgroups, with v2 being the modern default:

Feature	cgroups v1	cgroups v2
Hierarchy	Separate per controller	Single unified
Process membership	Can differ per controller	One cgroup only
Delegation	Complex, error-prone	Clean subtree delegation
Default (2024+)	Legacy	Default
Docker support	Full	Full (recent versions)

Why Unified Hierarchy Matters

In v1, a process could be in /cpu/app for CPU limits but /memory/web for memory limits. This created confusion and made resource accounting inconsistent.

v2's unified hierarchy means a process is in exactly one cgroup, and that cgroup can have multiple controllers enabled. This is simpler to manage and reason about.

# v2: Enable controllers for a cgroup
echo "+cpu +memory +io" > /sys/fs/cgroup/myapp/cgroup.subtree_control

# Now child cgroups can use these controllers
mkdir /sys/fs/cgroup/myapp/worker1
echo 268435456 > /sys/fs/cgroup/myapp/worker1/memory.max

Resource Controllers

CPU Controller

The CPU controller limits how much CPU time a cgroup can use. The key mechanism is CFS bandwidth throttling.

CPU Bandwidth Throttling (CFS Quota)

Visualize how cgroups CPU quota/period controls CPU bandwidth. The process can use CPU freely until it exhausts its quota, then it's throttled until the next period.

CPU Timeline

Running

Throttled

Idle

Period boundary

marks new period

Current Period

Quota Used0% / 50%

Slots Running

Slots Throttled

Periods Elapsed

CFS Quota Settings

cpu.cfs_quota_us50000

10%100%

cpu.cfs_period_us100000

Effective CPU50%

Enable CPU burst (use unused quota)

Formula:

CPU% = quota_us / period_us × 100

A quota of 50000µs with 100000µs period = 50% of one CPU core.

How Docker Uses This

docker run --cpus=0.5 sets quota=50000, period=100000 (50% of one CPU)

docker run --cpu-period=100000 --cpu-quota=200000 allows using 2 CPU cores worth of time

CPU Settings

File	Description	Example
`cpu.max`	Quota and period in µs	`50000 100000` = 50%
`cpu.weight`	Proportional share (1-10000)	`100` = default
`cpu.pressure`	PSI metrics (stall time)	Read-only

Quota Math

CPU cores usable = quota_us / period_us

# Examples:
50000/100000 = 0.5 cores (50% of one core)
200000/100000 = 2.0 cores (can use 2 cores fully)
max/100000 = unlimited (the default)

Memory Controller

The memory controller limits RAM usage and handles memory pressure:

Resource Limit Simulator

Adjust cgroup limits and watch how different workloads get throttled. See CPU throttling, memory pressure, and I/O limits in action.

Workload Type

CPU Limit100%

Memory Limit100%

I/O Limit100%

Current Usage

CPU

0.0%

Memory

0.0%

I/O

0.0%

Running Normally

All resources within limits. Process running at full speed.

Try this: Run the Memory Hog workload with a 50% memory limit. Watch memory climb until it hits the limit, then the OOM killer terminates the process - exactly what happens in real containers!

Memory Settings

File	Description
`memory.max`	Hard limit (OOM kill if exceeded)
`memory.high`	Soft limit (throttle allocations)
`memory.low`	Memory protection (won't reclaim unless necessary)
`memory.min`	Guaranteed minimum (never reclaim)
`memory.current`	Current usage
`memory.swap.max`	Swap limit

The OOM Killer

When a cgroup exceeds memory.max and can't reclaim pages:

Kernel triggers the OOM killer
OOM killer selects a process in the cgroup to kill
Process is sent SIGKILL
Memory is freed

This is exactly what happens when a Docker container runs out of memory!

# Check if a cgroup has had OOM events
cat /sys/fs/cgroup/docker/abc123/memory.events
# oom 5
# oom_kill 5

I/O Controller

Limits disk bandwidth and IOPS:

File	Description
`io.max`	Bandwidth/IOPS limits per device
`io.weight`	Proportional weight (1-10000)
`io.pressure`	PSI stall metrics

# Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Limit to 1000 read IOPS, 500 write IOPS
echo "8:0 riops=1000 wiops=500" > io.max

PIDs Controller

Prevents fork bombs by limiting the number of processes:

# Limit to 100 processes
echo 100 > /sys/fs/cgroup/myapp/pids.max

# Check current count
cat /sys/fs/cgroup/myapp/pids.current
# 47

How Docker Uses cgroups

When you run docker run with resource flags:

docker run \
  --cpus=0.5 \           # cpu.max = "50000 100000"
  --memory=512m \        # memory.max = 536870912
  --memory-swap=512m \   # memory.swap.max = 0 (no swap)
  --pids-limit=100 \     # pids.max = 100
  --device-read-bps /dev/sda:10mb \  # io.max
  nginx

Docker creates a cgroup at /sys/fs/cgroup/docker/<container-id>/ and configures all these limits automatically.

📋 Inspect Docker Container cgroups (click to expand)

# Find container's cgroup path
docker inspect --format '{{.HostConfig.CgroupParent}}' mycontainer

# View all limits for a container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' mycontainer)
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max
cat /sys/fs/cgroup/docker/$CONTAINER_ID/cpu.max

# Real-time resource usage
docker stats mycontainer

# Detailed cgroup info
cat /proc/$(docker inspect --format '{{.State.Pid}}' mycontainer)/cgroup

systemd and cgroups

systemd uses cgroups v2 extensively for service management. Every unit gets its own cgroup:

# View systemd cgroup structure
systemd-cgls

# Cgroup for a service
systemctl show docker.service --property=ControlGroup
# ControlGroup=/system.slice/docker.service

# Resource usage
systemctl status docker.service
# Memory: 150.4M
# CPU: 2.341s

Setting Limits via systemd

# /etc/systemd/system/myapp.service
[Service]
ExecStart=/usr/bin/myapp
MemoryMax=512M
CPUQuota=50%
TasksMax=100
IOWeight=50

Pressure Stall Information (PSI)

Linux 4.20+ provides PSI metrics showing when processes are stalled waiting for resources:

cat /sys/fs/cgroup/docker/abc123/cpu.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=123456
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

cat /sys/fs/cgroup/docker/abc123/memory.pressure
# some avg10=5.23 avg60=3.15 avg300=1.82 total=987654321
# full avg10=2.11 avg60=1.03 avg300=0.54 total=123456789

Metric	Meaning
`some`	Percentage of time some tasks are stalled
`full`	Percentage of time all tasks are stalled
`avg10/60/300`	Averages over 10s, 60s, 5min

PSI is invaluable for detecting resource contention before it becomes critical.

Practical cgroup Management

📋 Common cgroup Commands (click to expand)

# Create a cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for child cgroups
echo "+cpu +memory +io +pids" > /sys/fs/cgroup/mygroup/cgroup.subtree_control

# Create a child cgroup
mkdir /sys/fs/cgroup/mygroup/worker

# Set limits
echo 100000 > /sys/fs/cgroup/mygroup/worker/cpu.max
echo 268435456 > /sys/fs/cgroup/mygroup/worker/memory.max
echo 50 > /sys/fs/cgroup/mygroup/worker/pids.max

# Add current shell to the cgroup
echo $$ > /sys/fs/cgroup/mygroup/worker/cgroup.procs

# View current cgroup
cat /proc/self/cgroup
# 0::/mygroup/worker

# View processes in a cgroup
cat /sys/fs/cgroup/mygroup/worker/cgroup.procs

# View cgroup events (OOM kills, etc.)
cat /sys/fs/cgroup/mygroup/worker/memory.events

# Remove a cgroup (must be empty)
rmdir /sys/fs/cgroup/mygroup/worker

Delegation (Unprivileged cgroup Management)

cgroups v2 allows delegating subtrees to non-root users:

# Create a cgroup for user 1000
mkdir /sys/fs/cgroup/user-1000
chown -R 1000:1000 /sys/fs/cgroup/user-1000

# User can now create and manage child cgroups
su - user1000
mkdir /sys/fs/cgroup/user-1000/myapp
echo $$ > /sys/fs/cgroup/user-1000/myapp/cgroup.procs

This enables rootless containers (like Podman) to manage their own resource limits.

Common Pitfalls

Watch Out For

1. Kernel memory accounting

Page cache and kernel structures count toward memory limits. A 512MB container might OOM even with 400MB heap because of buffer cache.

2. CPU throttling latency

A container throttled to 10% CPU might have 100ms latency spikes at period boundaries. For latency-sensitive apps, use larger periods.

3. I/O limits and buffered I/O

I/O limits apply to direct I/O. Buffered writes go to page cache first and may exceed limits temporarily.

4. cgroups v1/v2 mixing

Don't mix v1 and v2 for the same controller. Modern systems should use v2 exclusively.

Essential Takeaways

1.cgroups limit resources, namespaces isolate visibility - both needed for containers

2.v2 unified hierarchy is the modern default - one cgroup per process

3.CPU quota/period controls bandwidth: 50000/100000 = 50% of one core

4.memory.max triggers OOM killer when exceeded - the "out of memory" container crash

5.Everything is files in /sys/fs/cgroup - read/write to control limits

6.PSI metrics reveal resource pressure before failures occur

7.systemd manages cgroups for services automatically via unit files

8.Docker flags like --cpus, --memory translate directly to cgroup settings

Linux Namespaces: Isolation for process views (complementary to cgroups)
Containers Under the Hood: How namespaces + cgroups create containers
Memory Management: How Linux manages memory and page cache
Process Management: Understanding processes and scheduling