The Illusion of Isolation
When you run a Docker container, it feels like a lightweight virtual machine - it has its own hostname, its own process tree starting from PID 1, its own network interfaces, and its own filesystem. But there's no hypervisor, no separate kernel. How does Linux create this illusion?
The answer is namespaces - a kernel feature that partitions system resources so that different processes see different views of the system.
Analogy: The Truman Show
Think of namespaces like the town in The Truman Show. Truman believes he lives in a normal world, but everything he sees - the sky, the buildings, the people - is actually a constructed set. The "real world" exists outside, but Truman can't see or interact with it.
Similarly, a process in a namespace sees a constructed view of system resources. It believes it's PID 1 with full control, but the "real" system exists in the parent namespace.
The Seven Namespace Types
Linux has seven different namespace types, each isolating a different aspect of the system. Click each type to explore what it isolates and compare host vs namespace views:
The Seven Linux Namespaces
Click on each namespace type to explore what it isolates. Toggle between host and namespace views to see the difference.
Process ID Namespace
Isolates process ID number space. Processes in different PID namespaces can have the same PID.
What gets isolated:
- Process ID numbering (PIDs start from 1 inside)
- Process visibility (/proc filesystem view)
- Signal delivery between namespaces
- Parent-child process relationships
View Comparison
Key insight: Namespaces don't provide resource limits (that's cgroups) or security (that's capabilities/seccomp). They only provide isolation - making resources invisible between namespace boundaries.
How Namespaces Work
The Kernel's Perspective
Every process in Linux has a task_struct containing pointers to its namespace memberships. When a process makes a syscall that depends on isolated resources (like listing PIDs or network interfaces), the kernel consults these pointers to determine what the process should see.
// Simplified view of task_struct namespace membership struct task_struct { // ... other fields ... struct nsproxy *nsproxy; // Points to namespace set }; struct nsproxy { struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; struct net *net_ns; struct cgroup_namespace *cgroup_ns; struct user_namespace *user_ns; };
Creating Namespaces
There are three ways for processes to enter namespaces:
| Syscall | Usage | Description |
|---|---|---|
clone() | Process creation | Create child in new namespace(s) |
unshare() | Current process | Move calling process to new namespace(s) |
setns() | Existing namespace | Join an existing namespace by fd |
📋 Namespace Commands (click to expand)
# List all namespaces on the system lsns # Create new namespace and run command unshare --pid --fork --mount-proc bash # Enter existing namespace (requires namespace fd or PID) nsenter --target 1234 --pid --net bash # See what namespaces a process belongs to ls -la /proc/$$/ns/ # Create persistent namespace ip netns add mynetns ip netns exec mynetns ip addr
PID Namespace Deep Dive
The PID namespace is perhaps the most iconic - it's what makes container processes appear to start from PID 1. Watch how the same processes have different PIDs depending on which namespace is observing:
PID Namespace in Action
Create a PID namespace and fork processes. Watch how the same process has different PIDs depending on the observer's namespace.
What the host kernel sees
What processes inside see (PIDs start at 1)
Why PID 1 matters: In a PID namespace, PID 1 is special - it cannot be killed by signals (except SIGKILL from parent namespace), and orphaned processes are re-parented to it. This is why containers need a proper init process!
Key PID Namespace Properties
-
Hierarchical structure: PID namespaces form a tree. Parent namespaces can see all child namespace PIDs, but not vice versa.
-
PID translation: Each process has a PID in every ancestor namespace up to the root. The kernel maintains this mapping.
-
Init process (PID 1): The first process in a PID namespace becomes init. It has special signal handling - only signals it has handlers for can kill it (except from parent namespace).
-
Orphan adoption: When a parent dies, children are re-parented to the namespace's init process (PID 1), not the host's init.
# Demonstrate PID namespace $ unshare --pid --fork --mount-proc bash # Now inside new PID namespace $ echo $$ 1 # We are PID 1! $ ps aux USER PID COMMAND root 1 bash # Only our processes visible
Network Namespace Simulation
Network namespaces provide isolated network stacks. This is how containers get their own IP addresses and port bindings. Build a container-like network topology and watch packets flow:
Network Namespace Simulator
Build a container-like network topology: create namespaces, add a bridge, connect with veth pairs, and watch packets flow.
Interfaces:
How Docker does it: Each container gets its own network namespace. A bridge (docker0) connects containers via veth pairs. Port mapping uses iptables NAT rules to forward traffic from the host to the container's namespace.
Container Networking Patterns
Docker and other container runtimes use network namespaces in several configurations:
| Mode | Description | Use Case |
|---|---|---|
| Bridge | Container connects to bridge via veth | Default, provides NAT |
| Host | Container shares host's network namespace | Maximum performance |
| None | Only loopback interface | Security isolation |
| Container | Share another container's network namespace | Pod-like sharing (Kubernetes) |
Mount Namespace: Filesystem Views
The mount namespace was the first namespace type (2002), originally just called "namespace". It isolates the mount table, allowing different processes to see different filesystem hierarchies.
Key Use Cases
- Container root filesystem: Each container has its own
/usingpivot_root()orchroot() - Bind mounts: Mount host directories into container without affecting host view
- tmpfs for /tmp: Give each container isolated temporary storage
- Hiding sensitive paths: Don't mount
/etc/shadowinto containers
Mount Propagation
Mounts can be configured to propagate (or not) between namespaces:
| Type | Behavior |
|---|---|
| shared | Mounts propagate bidirectionally |
| slave | Mounts propagate from master to slave only |
| private | No propagation |
| unbindable | Cannot be bind-mounted |
User Namespace: Unprivileged Containers
The user namespace is the newest (kernel 3.8) and most powerful. It maps UIDs/GIDs between namespaces, enabling rootless containers.
The Magic of User Namespaces
Inside the namespace: Process runs as root (UID 0) with full capabilities
On the host: Process runs as unprivileged user (e.g., UID 100000)
Even if the container process escapes, it has no privileges on the host!
# Run as root inside namespace, nobody outside $ id uid=1000(alice) gid=1000(alice) $ unshare --user --map-root-user bash $ id uid=0(root) gid=0(root) # But on the host, still alice!
Namespace Lifecycle
Creation
Namespaces are created implicitly when the first process enters them (via clone() or unshare()). They're reference-counted kernel objects.
Persistence
A namespace persists as long as:
- At least one process is a member
- A mount holds a reference (
/proc/[pid]/ns/[type]) - An open file descriptor refers to it
Destruction
When the last reference is released, the namespace and its resources are cleaned up. For network namespaces, this means all virtual interfaces are destroyed.
# Create persistent network namespace ip netns add myns # Creates /var/run/netns/myns bind mount # Namespace persists even with no processes ip netns exec myns ip link # lo only # Delete when done ip netns del myns
Security Considerations
Important Security Notes
Namespaces provide isolation, not security. They hide resources but don't prevent access if the boundary is breached.
Complete container security requires multiple layers:
- Namespaces (isolation)
- cgroups (resource limits)
- Seccomp (syscall filtering)
- Capabilities (privilege restriction)
- SELinux/AppArmor (mandatory access control)
Common Escape Vectors
- Shared kernel: All containers share the host kernel - kernel exploits affect everyone
- Privileged containers:
--privilegeddisables most isolation - Sensitive mounts: Mounting
/proc,/sys, or device files can provide escape paths - CAP_SYS_ADMIN: This capability enables many namespace-breaking operations
Practical Examples
Manual Container-like Isolation
# Create all namespaces except user (requires root) unshare --pid --net --mount --uts --ipc --fork bash # Set hostname (UTS namespace) hostname my-container # Mount new proc (after PID namespace) mount -t proc proc /proc # Now we have basic container-like isolation!
Inspecting Container Namespaces
# Find container's init process docker inspect --format '{{.State.Pid}}' mycontainer # Returns: 12345 # List its namespaces ls -la /proc/12345/ns/ # lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026532583]' # lrwxrwxrwx 1 root root 0 ... ipc -> 'ipc:[4026532517]' # ... # Enter the container's namespaces manually nsenter --target 12345 --all bash
Essential Takeaways
Related Concepts
- cgroups: Resource limits for processes (CPU, memory, I/O)
- Containers Under the Hood: How namespaces + cgroups combine to create containers
- Process Management: Understanding fork, exec, and process trees
- Kernel Architecture: How the kernel manages these abstractions
