What is a Container, Really?
Here's a statement that surprises many developers: A container is just a Linux process.
There's no special "container" system call. No kernel module named "docker". When you run docker run nginx, you're ultimately just running the nginx process with some clever configuration:
- Namespaces make it see an isolated system
- cgroups limit what resources it can use
- OverlayFS gives it its own filesystem
- Security features restrict what it can do
That's it. A container is a regular process, wrapped in isolation and limits.
Analogy: The Escape Room
Imagine putting someone in an escape room:
- Namespaces = The room's walls (they can't see the outside building)
- cgroups = A time limit and item restrictions (limited resources)
- OverlayFS = Props and furniture (a curated environment)
- seccomp/capabilities = Rules about what they can touch
The person is still in the same building (kernel), but their experience is completely controlled.
The Container Stack
Before diving into primitives, let's understand the layers involved when you run docker run:
┌─────────────────────────────────────────┐ │ Your Application │ ├─────────────────────────────────────────┤ │ Container Image (OCI) │ ├─────────────────────────────────────────┤ │ High-level Runtime (containerd) │ ← Manages lifecycle, images ├─────────────────────────────────────────┤ │ Low-level Runtime (runc) │ ← Actually creates containers ├─────────────────────────────────────────┤ │ Linux Kernel (namespaces, cgroups) │ └─────────────────────────────────────────┘
| Component | Role | Example |
|---|---|---|
| OCI Image | Filesystem layers + config | nginx:latest |
| High-level runtime | Image management, lifecycle | containerd, CRI-O |
| Low-level runtime | Create/run containers | runc, crun, kata |
| Kernel primitives | Actual isolation | namespaces, cgroups |
Building a Container from Scratch
Watch step-by-step how a container runtime creates a container. Each step uses specific Linux syscalls:
Container Build Simulator
Watch step-by-step how a container runtime (like runc) creates a container from scratch. Each step uses specific Linux primitives.
Create child process with new namespaces using clone() syscall
Create cgroup and set resource limits for the container
Create overlay filesystem from image layers
Switch container root filesystem, hide host filesystem
Create veth pair and connect to bridge
Drop capabilities, apply seccomp filter
Replace init process with container application
Click "Next Step" or "Run All" to begin the container creation process
The Simplified Flow
# What docker run actually does (simplified) 1. Pull image layers → /var/lib/docker/overlay2/ 2. Create container directory 3. fork() → clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) 4. Set up cgroups → /sys/fs/cgroup/docker/<id>/ 5. Mount overlay filesystem → /var/lib/docker/overlay2/<id>/merged 6. pivot_root() → switch root to container filesystem 7. Configure network (veth pair → bridge) 8. Apply security (drop capabilities, seccomp) 9. execve() → run the application
Container Image Layers
Docker images use a layered filesystem. This enables efficient storage and fast distribution. Try modifying files to see Copy-on-Write in action:
OverlayFS Layer Demo
Container images are built from stacked layers. OverlayFS merges them into a single view. Writes go to the top layer (Copy-on-Write). Click files to modify them.
Empty - modifications will appear here
Click a file to view details and try modifying it
Key insight: Image layers are never modified. When you "change" a file, OverlayFS copies it to the container layer first (Copy-on-Write). This is why container images can be shared between many containers - they all use the same read-only layers!
How OverlayFS Works
OverlayFS merges multiple directories into a single unified view:
┌─────────────────────────────────────────┐ │ Merged View (container /) │ ← What process sees ├─────────────────────────────────────────┤ │ Upper Layer (container writes) │ ← Writable ├─────────────────────────────────────────┤ │ Lower Layer 3 (app code) │ ← Read-only ├─────────────────────────────────────────┤ │ Lower Layer 2 (python) │ ← Read-only ├─────────────────────────────────────────┤ │ Lower Layer 1 (alpine) │ ← Read-only └─────────────────────────────────────────┘
Key concepts:
| Concept | Description |
|---|---|
| Copy-on-Write | Modifying a file copies it to upper layer first |
| Whiteout files | Special files that "delete" lower layer files |
| Layer sharing | Multiple containers share the same lower layers |
| Image layers | Created by each Dockerfile instruction |
# Mount an overlay filesystem manually mount -t overlay overlay \ -o lowerdir=/layer3:/layer2:/layer1,\ upperdir=/container/upper,\ workdir=/container/work \ /merged
Container Networking
Containers get network isolation via network namespaces, but need connectivity. Docker uses several patterns:
Bridge Networking (Default)
┌─────────────────────────────────────────────────────────┐ │ Host │ │ ┌─────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │Container│ │ Container │ │ Host │ │ │ │ eth0 │ │ eth0 │ │ Network │ │ │ │172.17.0.2 │ 172.17.0.3 │ │192.168.1.100│ │ │ └────┬────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ veth pair │ veth pair │ │ │ │ │ │ │ │ ┌────┴─────────────────┴────────────────────┴────┐ │ │ │ docker0 (bridge) │ │ │ │ 172.17.0.1 │ │ │ └────────────────────────────────────────────────┘ │ │ │ │ │ NAT (iptables) │ │ │ │ └─────────────────────────┼───────────────────────────────┘ │ Internet
Port Mapping
When you run docker run -p 8080:80 nginx:
- Docker creates an iptables DNAT rule
- Traffic to host:8080 redirects to container:80
- Container responds through same path
# View Docker's iptables rules iptables -t nat -L DOCKER -n # DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80
Container vs Virtual Machine
Understand when to use each technology:
Container vs Virtual Machine
Click categories to compare containers and VMs across different dimensions. Both have their place - the choice depends on your requirements.
- •Hypervisor (VMware, KVM, Hyper-V)
- •Complete guest OS kernel
- •Emulated hardware (CPU, RAM, NIC)
- •Each VM has own kernel
- •Container runtime (Docker, containerd)
- •Shared host kernel
- •Namespace isolation (PID, NET, MNT)
- •cgroups for resource limits
VM Architecture Stack
Container Architecture Stack
Choose VMs When:
- • Running different OS kernels (Windows on Linux host)
- • Strong isolation required (hostile multi-tenancy)
- • Running untrusted code
- • Legacy applications with specific OS requirements
- • Compliance requires hardware-level isolation
Choose Containers When:
- • Fast startup/shutdown needed
- • High density required (many instances)
- • CI/CD pipelines and microservices
- • Development environment consistency
- • Resource efficiency is priority
Hybrid approach: Many organizations run containers inside VMs for defense-in-depth. The VM provides strong isolation between tenants, while containers provide density and speed within each tenant's environment.
The Shared Kernel Trade-off
The biggest difference is the kernel:
| Aspect | VM | Container |
|---|---|---|
| Kernel | Separate per VM | Shared |
| Kernel exploit | Affects one VM | Affects all containers |
| Different kernels | Yes (Windows on Linux) | No (Linux only) |
| Syscall filtering | Not needed | Critical (seccomp) |
This is why container security requires defense-in-depth: seccomp, capabilities, AppArmor/SELinux, and careful image curation.
Security Layers
Containers need multiple security mechanisms since they share the kernel:
1. Capabilities
Linux capabilities split root's powers into ~40 individual capabilities:
# Default Docker drops these (among others): CAP_SYS_ADMIN # Mount, namespace operations CAP_NET_ADMIN # Network configuration CAP_SYS_PTRACE # Debug other processes CAP_SYS_MODULE # Load kernel modules # Check container capabilities docker run --rm alpine cat /proc/self/status | grep Cap
2. seccomp
System call filtering - block dangerous syscalls entirely:
{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "exit", "futex", ...], "action": "SCMP_ACT_ALLOW" } ] }
Docker's default profile blocks ~44 syscalls including reboot, kexec_load, mount (in most cases).
3. User Namespaces (Rootless)
Map container root to unprivileged host user:
# Inside container: UID 0 (root) # On host: UID 100000 (nobody) # Run rootless container docker run --userns=host alpine id # uid=0(root) ... but actually maps to high UID on host
4. Read-only Filesystems
docker run --read-only nginx # Or with temporary write areas docker run --read-only --tmpfs /tmp nginx
The OCI Specification
The Open Container Initiative standardizes container formats:
OCI Image Spec
Defines how images are structured:
- Manifest: List of layers and config
- Config: Runtime settings (env vars, entrypoint)
- Layers: Filesystem tarballs
OCI Runtime Spec
Defines config.json format for runc:
{ "ociVersion": "1.0.0", "process": { "terminal": false, "user": { "uid": 0, "gid": 0 }, "args": ["/bin/sh"], "env": ["PATH=/usr/bin:/bin"], "cwd": "/" }, "root": { "path": "rootfs", "readonly": false }, "linux": { "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "mount" } ], "resources": { "memory": { "limit": 536870912 } } } }
You can run containers with just runc:
# Create OCI bundle mkdir -p mycontainer/rootfs cd mycontainer docker export $(docker create alpine) | tar -C rootfs -xf - runc spec # Creates config.json # Run container sudo runc run mycontainer
Real-World Container Flow
What actually happens when you run docker run -d -p 8080:80 --memory=512m nginx:
📋 Complete Flow (click to expand)
1. Docker CLI → Docker daemon (REST API) POST /containers/create POST /containers/{id}/start 2. Docker daemon → containerd (gRPC) - Pull image if needed (registry → local storage) - Create container metadata - Create OCI bundle (config.json + rootfs) 3. containerd → runc (exec) - runc init: Set up namespaces, cgroups - runc create: Prepare container - runc start: Execute entrypoint 4. runc performs: a. clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) b. mkdir /sys/fs/cgroup/docker/{id} c. echo 536870912 > memory.max d. mount overlay filesystem e. pivot_root to new root f. Create veth pair, attach to docker0 g. Drop capabilities, apply seccomp h. execve("/docker-entrypoint.sh") 5. Docker daemon: - Add iptables DNAT rule for port 8080→80 - Register container in internal DB - Return container ID to CLI 6. nginx is now running: - PID 1 in its namespace - eth0 with 172.17.0.x IP - 512MB memory limit - Isolated filesystem view - Accepting connections on port 80
Common Container Patterns
Init Process Problem
Container PID 1 has special responsibilities:
- Reap zombie processes
- Handle signals properly
Many applications aren't designed to be PID 1. Solution: use a proper init:
# Use tini as init docker run --init nginx # Or in Dockerfile FROM nginx RUN apt-get update && apt-get install -y tini ENTRYPOINT ["/usr/bin/tini", "--"] CMD ["nginx", "-g", "daemon off;"]
Sidecar Containers
Share namespaces between containers:
# Main application docker run -d --name app myapp # Sidecar shares network namespace docker run -d --network container:app logging-sidecar
Multi-stage Builds
Keep images small by separating build and runtime:
# Build stage FROM golang:1.21 AS builder COPY . /app RUN go build -o /app/server # Runtime stage (much smaller) FROM alpine:3.18 COPY /app/server /server CMD ["/server"]
Essential Takeaways
Related Concepts
- Linux Namespaces: Deep dive into the seven namespace types
- Linux cgroups: Understanding resource limits and controllers
- Process Management: fork(), exec(), and process lifecycle
- Kernel Architecture: How the kernel provides these primitives
