Containers Aren’t Magic
When you run docker run nginx, something remarkable happens. The nginx process gets its own hostname, its own filesystem starting from /, its own PID 1, its own network interfaces — yet there’s no hypervisor, no separate kernel, no virtual hardware. It’s still just a Linux process.
The mechanism behind this illusion is namespaces — a kernel feature that gives different processes different views of system resources. A process inside a namespace sees a constructed reality: it believes it’s the only thing running on the machine, but the host kernel knows better.
Think of it like The Truman Show. Truman lives in a complete, self-consistent world. Everything he interacts with — the sky, the buildings, the people — is real to him, but it’s actually a constructed set inside a much larger studio. Namespaces work the same way: each container lives in a constructed set of system resources, inside the much larger host.
What Makes a Container?
A container is not a single kernel feature. It’s a combination of six namespace types, cgroup resource limits, and security restrictions. Each namespace isolates a different aspect of the system, and they’re not equally important. Mount namespace is the bedrock — without filesystem isolation, nothing else matters. User namespace provides the critical security boundary. The others fill in the gaps.
Namespace Importance Hierarchy
6 Linux namespace types ranked by their role in container isolation
CLONE_NEWNSFilesystem view isolation
- What it isolates
- Mount table, filesystem hierarchy
- Why it matters
- Without filesystem isolation, containers share the host's entire filesystem tree at /
- Kernel struct field
task_struct→nsproxy→mnt_ns
CLONE_NEWUSERUID/GID mapping
- What it isolates
- User and group IDs, capabilities
- Why it matters
- Enables rootless containers by mapping container root (UID 0) to an unprivileged host user
- Kernel struct field
task_struct→nsproxy→user_ns
CLONE_NEWPIDProcess tree isolation
- What it isolates
- Process ID number space
- Why it matters
- Container gets its own PID 1 init process and can't see host processes
- Kernel struct field
task_struct→nsproxy→pid_ns
CLONE_NEWNETNetwork stack isolation
- What it isolates
- Network interfaces, routing tables, iptables, sockets
- Why it matters
- Containers get their own IP addresses and independent port space
- Kernel struct field
task_struct→nsproxy→net_ns
CLONE_NEWCGROUPCgroup root isolation
- What it isolates
- Cgroup filesystem view
- Why it matters
- Container sees its own cgroup as the root and can't inspect the host hierarchy
- Kernel struct field
task_struct→nsproxy→cgroup_ns
CLONE_NEWUTSHostname isolation
- What it isolates
- Hostname, domain name
- Why it matters
- Each container can set and advertise its own hostname independently
- Kernel struct field
task_struct→nsproxy→uts_ns
6 namespaces + cgroups + security = container
Every process in Linux has a task_struct containing pointers to its namespace memberships through the nsproxy structure:
struct nsproxy { struct uts_namespace *uts_ns; // hostname struct ipc_namespace *ipc_ns; // IPC resources struct mnt_namespace *mnt_ns; // mount table struct pid_namespace *pid_ns; // PID number space struct net *net_ns; // network stack struct cgroup_namespace *cgroup_ns; // cgroup view // user_ns is on the credential struct, not nsproxy };
When a process makes a syscall that depends on isolated resources — listing PIDs, looking up network interfaces, resolving mount points — the kernel consults these pointers to determine what the process should see.
Mount Namespace: The Bedrock
Mount namespaces were the first namespace type, added to Linux 2.4.19 in 2002. They were originally just called “namespaces” because nobody imagined there would be others. The CLONE_NEWNS flag still reflects this — “new namespace,” not “new mount namespace.”
Why Mount Comes First
Without mount namespace isolation, a container shares the host’s filesystem. It can see /etc/shadow, write to /var/log, modify /usr/bin — there’s no container at all, just a process with a different hostname. Mount namespaces give each container its own mount table: a private list of what filesystems are mounted where. Changes to one mount table don’t affect others.
pivot_root() vs chroot()
Both change what a process sees as /, but they’re fundamentally different:
- chroot() changes the apparent root directory but the old root is still accessible through
/proc, file descriptors, and other leaks. It’s a path resolution trick, not real isolation. - pivot_root() actually moves the root mount. The old root becomes a child mount that can be unmounted entirely. After
pivot_root()+umount, the old filesystem is genuinely gone from the container’s view.
Containers use pivot_root(). It’s the only way to get clean filesystem isolation.
OverlayFS: Layered Filesystems
Container images use OverlayFS to combine read-only image layers with a writable layer. The lower layers contain the base OS and application files (read-only, shared between containers). The upper layer captures all writes using copy-on-write: modifying a file copies it from lower to upper first. The merged view is what the container sees as its root filesystem.
This is how a 200 MB base image can be shared across 50 containers without using 10 GB of disk — each container only stores its own changes.
Mount Propagation
When a USB drive is mounted on the host, should containers see it? Mount propagation controls this:
| Type | Behavior | Use Case |
|---|---|---|
| private | No propagation in either direction | Default for containers |
| shared | Bidirectional — mounts appear in both namespaces | Host-container shared storage |
| slave | One-way — host mounts propagate to container, not reverse | Read-only host mounts |
| unbindable | Cannot be bind-mounted at all | Security-sensitive paths |
Containers default to private propagation. This prevents a rogue container from mounting filesystems visible to the host.
Mount Namespace & Filesystem Isolation
How containers get their own filesystem view using mount namespaces, pivot_root, and OverlayFS
Phase 0: Shared filesystem — both see identical mount table
Host View
Container View
PID Namespace: Process Isolation
PID namespaces give each container its own process ID number space. The first process in a new PID namespace becomes PID 1 — the container’s init process. It can only see processes in its own namespace and its children’s namespaces. The host can see everything.
The same process has different PIDs depending on who’s looking. The container sees its init as PID 1, but the host might see it as PID 4521. The kernel maintains this dual mapping transparently.
PID 1 in a namespace has special properties:
- Signal handling: only signals with registered handlers can kill it (except from the parent namespace)
- Orphan adoption: when a parent process dies, children are re-parented to the namespace’s PID 1, not the host’s init
- Namespace lifetime: when PID 1 exits, the entire namespace is destroyed and all processes in it are killed
PID Namespace Hierarchy
Visualizing how processes have different PIDs depending on the namespace perspective. Click any process to see its PID mapping.
Host Namespace
Full process tree with host PIDs
Container Namespace
Only container processes with container PIDs
# Demonstrate PID namespace $ unshare --pid --fork --mount-proc bash $ echo $$ 1 # We are PID 1 inside the namespace $ ps aux USER PID COMMAND root 1 bash # Only our processes visible
Network Namespace: Stack Isolation
Each network namespace gets a complete, independent network stack: its own interfaces, routing table, iptables rules, and socket port space. This is how multiple containers can all bind to port 80 without conflicting — each has its own port space.
The bridge between namespaces is the veth pair: a virtual ethernet cable with one end in each namespace. The host end connects to a bridge (like docker0), and the container end becomes the container’s eth0. Packets traverse: container eth0 → veth pair → bridge → host eth0 → internet.
Network Namespace Isolation
Explore how containers isolate network stacks using namespaces, veth pairs, and bridge networking.
Container Network Modes
Bridge (default): Container gets its own network namespace with a veth pair to the host bridge. Provides NAT and port mapping. Host: Container shares the host’s network namespace entirely. Maximum performance, zero isolation. None: Container only has a loopback interface. Used for security-sensitive workloads that don’t need network access. Container: Share another container’s network namespace. This is how Kubernetes pods work — all containers in a pod share one network namespace.
User Namespace: The Security Boundary
User namespaces map UIDs and GIDs between namespaces. Inside the container, the process runs as root (UID 0) with full capabilities. On the host, that same process runs as an unprivileged user (e.g., UID 100000). If the process escapes the container, it lands on the host as nobody — it can’t do anything.
This is the foundation of rootless containers. Without user namespaces, running a container requires host root privileges, which means a container escape is a full host compromise. With user namespaces, even a complete container escape leaves the attacker unprivileged.
$ id uid=1000(alice) gid=1000(alice) $ unshare --user --map-root-user bash $ id uid=0(root) gid=0(root) # Root inside the namespace! # But on the host, still alice with no extra privileges
User namespaces are also the owner of other namespaces. When you create a PID or network namespace inside a user namespace, the user namespace’s UID mapping governs who can perform privileged operations within those child namespaces.
UTS & Cgroup Namespaces
UTS namespace (CLONE_NEWUTS) isolates hostname and domain name. It’s simple but essential — without it, hostname my-container inside a container would change the host’s hostname. The name “UTS” comes from the utsname struct in the kernel (Unix Timesharing System).
Cgroup namespace (CLONE_NEWCGROUP) hides the host’s cgroup hierarchy. Without it, a container could read /sys/fs/cgroup and see the host’s resource allocation — how much memory other containers have, CPU shares, everything. Cgroup namespaces make the container’s own cgroup appear as the root of the hierarchy.
The Syscall Interface
Three syscalls manage namespace membership:
| Syscall | Usage | Description |
|---|---|---|
clone() | Process creation | Create a child process in new namespace(s). Pass CLONE_NEW* flags. |
unshare() | Current process | Move the calling process into new namespace(s). No new process created. |
setns() | Join existing | Enter an existing namespace via its file descriptor from /proc/[pid]/ns/. |
Namespace Commands Reference
lsns — List all namespaces on the system with their type, PID, and
command. unshare --pid --fork --mount-proc bash — Create new PID
namespace and start a shell. nsenter --target 1234 --pid --net bash
— Enter the PID and network namespaces of process 1234. ls -la /proc/$$/ns/ — See which namespaces the current shell belongs to.
ip netns add myns — Create a persistent network namespace.
Building a Container from Scratch
All the pieces come together when you build a container manually. Docker, containerd, and other runtimes automate this sequence, but underneath it’s the same steps: create namespaces, set up the filesystem, configure networking, apply limits, drop privileges, and exec the entrypoint.
Building a Container from Scratch
Step-by-step container creation showing each syscall and mechanism
The Starting Point
Host namespacesEvery process belongs to the host's default namespaces. We can see all host processes, the host filesystem, and the host network.
Security: Isolation ≠ Security
Namespaces provide isolation — they control what a process can see. But isolation alone isn’t security. A determined attacker who can exploit a kernel vulnerability has access to the shared kernel, and namespaces can’t prevent that.
Complete container security requires multiple layers:
- Namespaces — Resource isolation (what you can see)
- Cgroups — Resource limits (how much you can use)
- Seccomp — Syscall filtering (what you can do)
- Capabilities — Privilege restriction (which root powers you have)
- AppArmor/SELinux — Mandatory access control (file and network access policies)
Running containers with --privileged disables most of these protections. It gives the container nearly full host access and should only be used when absolutely necessary (and never in production with untrusted workloads).
Key Takeaways
Mount namespace is the bedrock — without filesystem isolation, nothing else matters. It was the first namespace type for a reason.
Containers are just processes in separate namespaces with cgroup limits and security restrictions. No hypervisor, no separate kernel.
User namespaces enable rootless containers — mapping container root to host nobody eliminates the biggest escape risk.
PID namespaces create private process trees with their own PID 1 that handles orphan adoption and signal delivery.
Network namespaces + veth pairs give containers isolated network stacks connected to the host via virtual bridges.
clone(), unshare(), setns() are the three syscalls for creating, entering, and joining namespaces.
OverlayFS + pivot_root() give containers their own root filesystem with copy-on-write efficiency.
Namespaces isolate but don’t secure — real container security needs cgroups, seccomp, capabilities, and MAC policies.
Related Concepts
- cgroups: Resource limits for processes (CPU, memory, I/O)
- Containers Under the Hood: How namespaces + cgroups combine to create containers
- Process Management: Understanding fork, exec, and process trees
- Kernel Architecture: How the kernel manages these abstractions
