Containers Under the Hood: From Primitives to Docker

What is a Container, Really?

Here's a statement that surprises many developers: A container is just a Linux process.

There's no special "container" system call. No kernel module named "docker". When you run docker run nginx, you're ultimately just running the nginx process with some clever configuration:

Namespaces make it see an isolated system
cgroups limit what resources it can use
OverlayFS gives it its own filesystem
Security features restrict what it can do

That's it. A container is a regular process, wrapped in isolation and limits.

Analogy: The Escape Room

Imagine putting someone in an escape room:

Namespaces = The room's walls (they can't see the outside building)
cgroups = A time limit and item restrictions (limited resources)
OverlayFS = Props and furniture (a curated environment)
seccomp/capabilities = Rules about what they can touch

The person is still in the same building (kernel), but their experience is completely controlled.

The Container Stack

Before diving into primitives, let's understand the layers involved when you run docker run:

┌─────────────────────────────────────────┐
│           Your Application              │
├─────────────────────────────────────────┤
│        Container Image (OCI)            │
├─────────────────────────────────────────┤
│     High-level Runtime (containerd)     │  ← Manages lifecycle, images
├─────────────────────────────────────────┤
│      Low-level Runtime (runc)           │  ← Actually creates containers
├─────────────────────────────────────────┤
│   Linux Kernel (namespaces, cgroups)    │
└─────────────────────────────────────────┘

Component	Role	Example
OCI Image	Filesystem layers + config	nginx:latest
High-level runtime	Image management, lifecycle	containerd, CRI-O
Low-level runtime	Create/run containers	runc, crun, kata
Kernel primitives	Actual isolation	namespaces, cgroups

Building a Container from Scratch

Watch step-by-step how a container runtime creates a container. Each step uses specific Linux syscalls:

Container Build Simulator

Watch step-by-step how a container runtime (like runc) creates a container from scratch. Each step uses specific Linux primitives.

Clone with Namespaces

Create child process with new namespaces using clone() syscall

Configure cgroups

Create cgroup and set resource limits for the container

Setup Root Filesystem

Create overlay filesystem from image layers

Pivot Root

Switch container root filesystem, hide host filesystem

Setup Networking

Create veth pair and connect to bridge

Apply Security

Drop capabilities, apply seccomp filter

Execute Entrypoint

Replace init process with container application

Click "Next Step" or "Run All" to begin the container creation process

The Simplified Flow

# What docker run actually does (simplified)
1. Pull image layers → /var/lib/docker/overlay2/
2. Create container directory
3. fork() → clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...)
4. Set up cgroups → /sys/fs/cgroup/docker/<id>/
5. Mount overlay filesystem → /var/lib/docker/overlay2/<id>/merged
6. pivot_root() → switch root to container filesystem
7. Configure network (veth pair → bridge)
8. Apply security (drop capabilities, seccomp)
9. execve() → run the application

Container Image Layers

Docker images use a layered filesystem. This enables efficient storage and fast distribution. Try modifying files to see Copy-on-Write in action:

OverlayFS Layer Demo

Container images are built from stacked layers. OverlayFS merges them into a single view. Writes go to the top layer (Copy-on-Write). Click files to modify them.

Container Layer (writable)upperdir

Empty - modifications will appear here

App Layerreadonly

/app/main.py2KB

/app/requirements.txt156B

/app/config.json512B

Python Layerreadonly

/usr/bin/python36.2MB

/usr/lib/python3.11/os.py38KB

/usr/lib/python3.11/json/__init__.py12KB

Base Layer (alpine:3.18)readonly

/bin/sh120KB

/etc/alpine-release6B

/lib/libc.musl-x86_64.so.1800KB

Select a File

Click a file to view details and try modifying it

Key insight: Image layers are never modified. When you "change" a file, OverlayFS copies it to the container layer first (Copy-on-Write). This is why container images can be shared between many containers - they all use the same read-only layers!

How OverlayFS Works

OverlayFS merges multiple directories into a single unified view:

┌─────────────────────────────────────────┐
│          Merged View (container /)      │  ← What process sees
├─────────────────────────────────────────┤
│     Upper Layer (container writes)      │  ← Writable
├─────────────────────────────────────────┤
│         Lower Layer 3 (app code)        │  ← Read-only
├─────────────────────────────────────────┤
│        Lower Layer 2 (python)           │  ← Read-only
├─────────────────────────────────────────┤
│         Lower Layer 1 (alpine)          │  ← Read-only
└─────────────────────────────────────────┘

Key concepts:

Concept	Description
Copy-on-Write	Modifying a file copies it to upper layer first
Whiteout files	Special files that "delete" lower layer files
Layer sharing	Multiple containers share the same lower layers
Image layers	Created by each Dockerfile instruction

# Mount an overlay filesystem manually
mount -t overlay overlay \
  -o lowerdir=/layer3:/layer2:/layer1,\
     upperdir=/container/upper,\
     workdir=/container/work \
  /merged

Container Networking

Containers get network isolation via network namespaces, but need connectivity. Docker uses several patterns:

Bridge Networking (Default)

┌─────────────────────────────────────────────────────────┐
│                        Host                              │
│  ┌─────────┐     ┌─────────────┐     ┌─────────────┐   │
│  │Container│     │  Container  │     │    Host     │   │
│  │  eth0   │     │    eth0     │     │   Network   │   │
│  │172.17.0.2     │  172.17.0.3 │     │192.168.1.100│   │
│  └────┬────┘     └──────┬──────┘     └──────┬──────┘   │
│       │                 │                    │          │
│       │   veth pair     │   veth pair        │          │
│       │                 │                    │          │
│  ┌────┴─────────────────┴────────────────────┴────┐    │
│  │              docker0 (bridge)                   │    │
│  │               172.17.0.1                        │    │
│  └────────────────────────────────────────────────┘    │
│                         │                               │
│                   NAT (iptables)                        │
│                         │                               │
└─────────────────────────┼───────────────────────────────┘
                          │
                      Internet

Port Mapping

When you run docker run -p 8080:80 nginx:

Docker creates an iptables DNAT rule
Traffic to host:8080 redirects to container:80
Container responds through same path

# View Docker's iptables rules
iptables -t nat -L DOCKER -n
# DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80

Container vs Virtual Machine

Understand when to use each technology:

Container vs Virtual Machine

Click categories to compare containers and VMs across different dimensions. Both have their place - the choice depends on your requirements.

Virtual Machine

Full Hardware Virtualization

•Hypervisor (VMware, KVM, Hyper-V)
•Complete guest OS kernel
•Emulated hardware (CPU, RAM, NIC)
•Each VM has own kernel

Container

OS-Level Virtualization

•Container runtime (Docker, containerd)
•Shared host kernel
•Namespace isolation (PID, NET, MNT)
•cgroups for resource limits

VM Architecture Stack

App A │ App B │ App C

Guest OS │ Guest OS │ Guest OS

Hypervisor (VMware/KVM/Hyper-V)

Host OS + Kernel

Physical Hardware

Container Architecture Stack

App A │ App B │ App C │ App D │ App E

Container Runtime (containerd/runc)

Host OS + Single Shared Kernel

Physical Hardware

Choose VMs When:

• Running different OS kernels (Windows on Linux host)
• Strong isolation required (hostile multi-tenancy)
• Running untrusted code
• Legacy applications with specific OS requirements
• Compliance requires hardware-level isolation

Choose Containers When:

• Fast startup/shutdown needed
• High density required (many instances)
• CI/CD pipelines and microservices
• Development environment consistency
• Resource efficiency is priority

Hybrid approach: Many organizations run containers inside VMs for defense-in-depth. The VM provides strong isolation between tenants, while containers provide density and speed within each tenant's environment.

The Shared Kernel Trade-off

The biggest difference is the kernel:

Aspect	VM	Container
Kernel	Separate per VM	Shared
Kernel exploit	Affects one VM	Affects all containers
Different kernels	Yes (Windows on Linux)	No (Linux only)
Syscall filtering	Not needed	Critical (seccomp)

This is why container security requires defense-in-depth: seccomp, capabilities, AppArmor/SELinux, and careful image curation.

Security Layers

Containers need multiple security mechanisms since they share the kernel:

1. Capabilities

Linux capabilities split root's powers into ~40 individual capabilities:

# Default Docker drops these (among others):
CAP_SYS_ADMIN     # Mount, namespace operations
CAP_NET_ADMIN     # Network configuration
CAP_SYS_PTRACE    # Debug other processes
CAP_SYS_MODULE    # Load kernel modules

# Check container capabilities
docker run --rm alpine cat /proc/self/status | grep Cap

2. seccomp

System call filtering - block dangerous syscalls entirely:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "exit", "futex", ...],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Docker's default profile blocks ~44 syscalls including reboot, kexec_load, mount (in most cases).

3. User Namespaces (Rootless)

Map container root to unprivileged host user:

# Inside container: UID 0 (root)
# On host: UID 100000 (nobody)

# Run rootless container
docker run --userns=host alpine id
# uid=0(root) ... but actually maps to high UID on host

4. Read-only Filesystems

docker run --read-only nginx

# Or with temporary write areas
docker run --read-only --tmpfs /tmp nginx

The OCI Specification

The Open Container Initiative standardizes container formats:

OCI Image Spec

Defines how images are structured:

Manifest: List of layers and config
Config: Runtime settings (env vars, entrypoint)
Layers: Filesystem tarballs

OCI Runtime Spec

Defines config.json format for runc:

{
  "ociVersion": "1.0.0",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh"],
    "env": ["PATH=/usr/bin:/bin"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "mount" }
    ],
    "resources": {
      "memory": { "limit": 536870912 }
    }
  }
}

You can run containers with just runc:

# Create OCI bundle
mkdir -p mycontainer/rootfs
cd mycontainer
docker export $(docker create alpine) | tar -C rootfs -xf -
runc spec  # Creates config.json

# Run container
sudo runc run mycontainer

Real-World Container Flow

What actually happens when you run docker run -d -p 8080:80 --memory=512m nginx:

📋 Complete Flow (click to expand)

1. Docker CLI → Docker daemon (REST API)
   POST /containers/create
   POST /containers/{id}/start

2. Docker daemon → containerd (gRPC)
   - Pull image if needed (registry → local storage)
   - Create container metadata
   - Create OCI bundle (config.json + rootfs)

3. containerd → runc (exec)
   - runc init: Set up namespaces, cgroups
   - runc create: Prepare container
   - runc start: Execute entrypoint

4. runc performs:
   a. clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...)
   b. mkdir /sys/fs/cgroup/docker/{id}
   c. echo 536870912 > memory.max
   d. mount overlay filesystem
   e. pivot_root to new root
   f. Create veth pair, attach to docker0
   g. Drop capabilities, apply seccomp
   h. execve("/docker-entrypoint.sh")

5. Docker daemon:
   - Add iptables DNAT rule for port 8080→80
   - Register container in internal DB
   - Return container ID to CLI

6. nginx is now running:
   - PID 1 in its namespace
   - eth0 with 172.17.0.x IP
   - 512MB memory limit
   - Isolated filesystem view
   - Accepting connections on port 80

Common Container Patterns

Init Process Problem

Container PID 1 has special responsibilities:

Reap zombie processes
Handle signals properly

Many applications aren't designed to be PID 1. Solution: use a proper init:

# Use tini as init
docker run --init nginx

# Or in Dockerfile
FROM nginx
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["nginx", "-g", "daemon off;"]

Sidecar Containers

Share namespaces between containers:

# Main application
docker run -d --name app myapp

# Sidecar shares network namespace
docker run -d --network container:app logging-sidecar

Multi-stage Builds

Keep images small by separating build and runtime:

# Build stage
FROM golang:1.21 AS builder
COPY . /app
RUN go build -o /app/server

# Runtime stage (much smaller)
FROM alpine:3.18
COPY --from=builder /app/server /server
CMD ["/server"]

Essential Takeaways

1.Containers are processes with namespaces (isolation) + cgroups (limits)

2.No separate kernel - all containers share the host kernel

3.OverlayFS stacks image layers with Copy-on-Write

4.pivot_root switches the container's filesystem view

5.veth pairs + bridge enable container networking

6.Security requires layers: capabilities, seccomp, user namespaces

7.OCI spec standardizes images and runtime configuration

8.Container vs VM: containers = fast + dense, VMs = strong isolation

Linux Namespaces: Deep dive into the seven namespace types
Linux cgroups: Understanding resource limits and controllers
Process Management: fork(), exec(), and process lifecycle
Kernel Architecture: How the kernel provides these primitives