Containers Under the Hood: From Primitives to Docker

Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.

Best viewed on desktop for optimal interactive experience

What is a Container, Really?

Here's a statement that surprises many developers: A container is just a Linux process.

There's no special "container" system call. No kernel module named "docker". When you run docker run nginx, you're ultimately just running the nginx process with some clever configuration:

  • Namespaces make it see an isolated system
  • cgroups limit what resources it can use
  • OverlayFS gives it its own filesystem
  • Security features restrict what it can do

That's it. A container is a regular process, wrapped in isolation and limits.

Analogy: The Escape Room

Imagine putting someone in an escape room:

  • Namespaces = The room's walls (they can't see the outside building)
  • cgroups = A time limit and item restrictions (limited resources)
  • OverlayFS = Props and furniture (a curated environment)
  • seccomp/capabilities = Rules about what they can touch

The person is still in the same building (kernel), but their experience is completely controlled.

The Container Stack

Before diving into primitives, let's understand the layers involved when you run docker run:

┌─────────────────────────────────────────┐ │ Your Application │ ├─────────────────────────────────────────┤ │ Container Image (OCI) │ ├─────────────────────────────────────────┤ │ High-level Runtime (containerd) │ ← Manages lifecycle, images ├─────────────────────────────────────────┤ │ Low-level Runtime (runc) │ ← Actually creates containers ├─────────────────────────────────────────┤ │ Linux Kernel (namespaces, cgroups) │ └─────────────────────────────────────────┘
ComponentRoleExample
OCI ImageFilesystem layers + confignginx:latest
High-level runtimeImage management, lifecyclecontainerd, CRI-O
Low-level runtimeCreate/run containersrunc, crun, kata
Kernel primitivesActual isolationnamespaces, cgroups

Building a Container from Scratch

Watch step-by-step how a container runtime creates a container. Each step uses specific Linux syscalls:

Container Build Simulator

Watch step-by-step how a container runtime (like runc) creates a container from scratch. Each step uses specific Linux primitives.

1
Clone with Namespaces

Create child process with new namespaces using clone() syscall

2
Configure cgroups

Create cgroup and set resource limits for the container

3
Setup Root Filesystem

Create overlay filesystem from image layers

4
Pivot Root

Switch container root filesystem, hide host filesystem

5
Setup Networking

Create veth pair and connect to bridge

6
Apply Security

Drop capabilities, apply seccomp filter

7
Execute Entrypoint

Replace init process with container application

Click "Next Step" or "Run All" to begin the container creation process

The Simplified Flow

# What docker run actually does (simplified) 1. Pull image layers → /var/lib/docker/overlay2/ 2. Create container directory 3. fork() → clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) 4. Set up cgroups → /sys/fs/cgroup/docker/<id>/ 5. Mount overlay filesystem → /var/lib/docker/overlay2/<id>/merged 6. pivot_root() → switch root to container filesystem 7. Configure network (veth pair → bridge) 8. Apply security (drop capabilities, seccomp) 9. execve() → run the application

Container Image Layers

Docker images use a layered filesystem. This enables efficient storage and fast distribution. Try modifying files to see Copy-on-Write in action:

OverlayFS Layer Demo

Container images are built from stacked layers. OverlayFS merges them into a single view. Writes go to the top layer (Copy-on-Write). Click files to modify them.

Container Layer (writable)upperdir

Empty - modifications will appear here

App Layerreadonly
/app/main.py2KB
/app/requirements.txt156B
/app/config.json512B
Python Layerreadonly
/usr/bin/python36.2MB
/usr/lib/python3.11/os.py38KB
/usr/lib/python3.11/json/__init__.py12KB
Base Layer (alpine:3.18)readonly
/bin/sh120KB
/etc/alpine-release6B
/lib/libc.musl-x86_64.so.1800KB
Select a File

Click a file to view details and try modifying it

Key insight: Image layers are never modified. When you "change" a file, OverlayFS copies it to the container layer first (Copy-on-Write). This is why container images can be shared between many containers - they all use the same read-only layers!

How OverlayFS Works

OverlayFS merges multiple directories into a single unified view:

┌─────────────────────────────────────────┐ │ Merged View (container /) │ ← What process sees ├─────────────────────────────────────────┤ │ Upper Layer (container writes) │ ← Writable ├─────────────────────────────────────────┤ │ Lower Layer 3 (app code) │ ← Read-only ├─────────────────────────────────────────┤ │ Lower Layer 2 (python) │ ← Read-only ├─────────────────────────────────────────┤ │ Lower Layer 1 (alpine) │ ← Read-only └─────────────────────────────────────────┘

Key concepts:

ConceptDescription
Copy-on-WriteModifying a file copies it to upper layer first
Whiteout filesSpecial files that "delete" lower layer files
Layer sharingMultiple containers share the same lower layers
Image layersCreated by each Dockerfile instruction
# Mount an overlay filesystem manually mount -t overlay overlay \ -o lowerdir=/layer3:/layer2:/layer1,\ upperdir=/container/upper,\ workdir=/container/work \ /merged

Container Networking

Containers get network isolation via network namespaces, but need connectivity. Docker uses several patterns:

Bridge Networking (Default)

┌─────────────────────────────────────────────────────────┐ │ Host │ │ ┌─────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │Container│ │ Container │ │ Host │ │ │ │ eth0 │ │ eth0 │ │ Network │ │ │ │172.17.0.2 │ 172.17.0.3 │ │192.168.1.100│ │ │ └────┬────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ veth pair │ veth pair │ │ │ │ │ │ │ │ ┌────┴─────────────────┴────────────────────┴────┐ │ │ │ docker0 (bridge) │ │ │ │ 172.17.0.1 │ │ │ └────────────────────────────────────────────────┘ │ │ │ │ │ NAT (iptables) │ │ │ │ └─────────────────────────┼───────────────────────────────┘ Internet

Port Mapping

When you run docker run -p 8080:80 nginx:

  1. Docker creates an iptables DNAT rule
  2. Traffic to host:8080 redirects to container:80
  3. Container responds through same path
# View Docker's iptables rules iptables -t nat -L DOCKER -n # DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80

Container vs Virtual Machine

Understand when to use each technology:

Container vs Virtual Machine

Click categories to compare containers and VMs across different dimensions. Both have their place - the choice depends on your requirements.

Virtual Machine
Full Hardware Virtualization
  • Hypervisor (VMware, KVM, Hyper-V)
  • Complete guest OS kernel
  • Emulated hardware (CPU, RAM, NIC)
  • Each VM has own kernel
Container
OS-Level Virtualization
  • Container runtime (Docker, containerd)
  • Shared host kernel
  • Namespace isolation (PID, NET, MNT)
  • cgroups for resource limits

VM Architecture Stack

App A │ App B │ App C
Guest OS │ Guest OS │ Guest OS
Hypervisor (VMware/KVM/Hyper-V)
Host OS + Kernel
Physical Hardware

Container Architecture Stack

App A │ App B │ App C │ App D │ App E
Container Runtime (containerd/runc)
Host OS + Single Shared Kernel
Physical Hardware

Choose VMs When:

  • • Running different OS kernels (Windows on Linux host)
  • • Strong isolation required (hostile multi-tenancy)
  • • Running untrusted code
  • • Legacy applications with specific OS requirements
  • • Compliance requires hardware-level isolation

Choose Containers When:

  • • Fast startup/shutdown needed
  • • High density required (many instances)
  • • CI/CD pipelines and microservices
  • • Development environment consistency
  • • Resource efficiency is priority

Hybrid approach: Many organizations run containers inside VMs for defense-in-depth. The VM provides strong isolation between tenants, while containers provide density and speed within each tenant's environment.

The Shared Kernel Trade-off

The biggest difference is the kernel:

AspectVMContainer
KernelSeparate per VMShared
Kernel exploitAffects one VMAffects all containers
Different kernelsYes (Windows on Linux)No (Linux only)
Syscall filteringNot neededCritical (seccomp)

This is why container security requires defense-in-depth: seccomp, capabilities, AppArmor/SELinux, and careful image curation.

Security Layers

Containers need multiple security mechanisms since they share the kernel:

1. Capabilities

Linux capabilities split root's powers into ~40 individual capabilities:

# Default Docker drops these (among others): CAP_SYS_ADMIN # Mount, namespace operations CAP_NET_ADMIN # Network configuration CAP_SYS_PTRACE # Debug other processes CAP_SYS_MODULE # Load kernel modules # Check container capabilities docker run --rm alpine cat /proc/self/status | grep Cap

2. seccomp

System call filtering - block dangerous syscalls entirely:

{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "exit", "futex", ...], "action": "SCMP_ACT_ALLOW" } ] }

Docker's default profile blocks ~44 syscalls including reboot, kexec_load, mount (in most cases).

3. User Namespaces (Rootless)

Map container root to unprivileged host user:

# Inside container: UID 0 (root) # On host: UID 100000 (nobody) # Run rootless container docker run --userns=host alpine id # uid=0(root) ... but actually maps to high UID on host

4. Read-only Filesystems

docker run --read-only nginx # Or with temporary write areas docker run --read-only --tmpfs /tmp nginx

The OCI Specification

The Open Container Initiative standardizes container formats:

OCI Image Spec

Defines how images are structured:

  • Manifest: List of layers and config
  • Config: Runtime settings (env vars, entrypoint)
  • Layers: Filesystem tarballs

OCI Runtime Spec

Defines config.json format for runc:

{ "ociVersion": "1.0.0", "process": { "terminal": false, "user": { "uid": 0, "gid": 0 }, "args": ["/bin/sh"], "env": ["PATH=/usr/bin:/bin"], "cwd": "/" }, "root": { "path": "rootfs", "readonly": false }, "linux": { "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "mount" } ], "resources": { "memory": { "limit": 536870912 } } } }

You can run containers with just runc:

# Create OCI bundle mkdir -p mycontainer/rootfs cd mycontainer docker export $(docker create alpine) | tar -C rootfs -xf - runc spec # Creates config.json # Run container sudo runc run mycontainer

Real-World Container Flow

What actually happens when you run docker run -d -p 8080:80 --memory=512m nginx:

📋 Complete Flow (click to expand)

1. Docker CLI → Docker daemon (REST API) POST /containers/create POST /containers/{id}/start 2. Docker daemon → containerd (gRPC) - Pull image if needed (registry → local storage) - Create container metadata - Create OCI bundle (config.json + rootfs) 3. containerd → runc (exec) - runc init: Set up namespaces, cgroups - runc create: Prepare container - runc start: Execute entrypoint 4. runc performs: a. clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) b. mkdir /sys/fs/cgroup/docker/{id} c. echo 536870912 > memory.max d. mount overlay filesystem e. pivot_root to new root f. Create veth pair, attach to docker0 g. Drop capabilities, apply seccomp h. execve("/docker-entrypoint.sh") 5. Docker daemon: - Add iptables DNAT rule for port 8080→80 - Register container in internal DB - Return container ID to CLI 6. nginx is now running: - PID 1 in its namespace - eth0 with 172.17.0.x IP - 512MB memory limit - Isolated filesystem view - Accepting connections on port 80

Common Container Patterns

Init Process Problem

Container PID 1 has special responsibilities:

  • Reap zombie processes
  • Handle signals properly

Many applications aren't designed to be PID 1. Solution: use a proper init:

# Use tini as init docker run --init nginx # Or in Dockerfile FROM nginx RUN apt-get update && apt-get install -y tini ENTRYPOINT ["/usr/bin/tini", "--"] CMD ["nginx", "-g", "daemon off;"]

Sidecar Containers

Share namespaces between containers:

# Main application docker run -d --name app myapp # Sidecar shares network namespace docker run -d --network container:app logging-sidecar

Multi-stage Builds

Keep images small by separating build and runtime:

# Build stage FROM golang:1.21 AS builder COPY . /app RUN go build -o /app/server # Runtime stage (much smaller) FROM alpine:3.18 COPY --from=builder /app/server /server CMD ["/server"]

Essential Takeaways

1.Containers are processes with namespaces (isolation) + cgroups (limits)
2.No separate kernel - all containers share the host kernel
3.OverlayFS stacks image layers with Copy-on-Write
4.pivot_root switches the container's filesystem view
5.veth pairs + bridge enable container networking
6.Security requires layers: capabilities, seccomp, user namespaces
7.OCI spec standardizes images and runtime configuration
8.Container vs VM: containers = fast + dense, VMs = strong isolation

If you found this explanation helpful, consider sharing it with others.

Mastodon