Linux Kernel Architecture: How Your OS Actually Works

The Invisible City That Runs Your Computer

Every time you press a key, open a file, or browse the web, you're relying on the Linux kernel—a 30+ million line program that most users never think about. It's like a city's infrastructure: invisible when working perfectly, essential always.

What happens when you type ls?

Your shell (user space) asks the kernel for directory contents
The kernel checks if you're allowed to read that directory
It asks the filesystem layer for the data
The filesystem asks the disk driver for the actual bytes
Data flows back up through each layer to your terminal

All of this happens in microseconds, thousands of times per second. The kernel orchestrates this dance between your programs and your hardware, ensuring that each program gets fair access to resources, security boundaries are enforced, and the whole system doesn't collapse when one program misbehaves.

Why Understanding This Matters

If you've ever wondered:

Why does my program slow down when another process is busy?
Why can't my app directly read from disk?
What actually happens during a "segfault"?
Why does Docker use "namespaces" and "cgroups"?

The answers all live in kernel architecture. Understanding these concepts transforms debugging from guesswork into systematic investigation.

Explore the Architecture

Click on each layer to understand the problem it solves. Then trace a syscall to see how data flows through the entire stack.

Click a layer to learn what problem it solves and how it works.

Monolithic Design

All kernel services run in Ring 0 with direct function calls. Fast, but one bug can crash everything.

Loadable Modules

Linux's compromise: get modularity via .ko files without microkernel's IPC overhead.

The Big Picture: How Linux Is Organized

The Great Debate: Monolithic vs Microkernel

Before we dive into Linux's design, let's understand why it's built this way.

The core problem: How do you organize millions of lines of operating system code?

The microkernel approach (used by Minix, QNX, some embedded systems):

Keep the kernel minimal: just IPC, basic scheduling, and memory management
Run everything else (filesystems, drivers, networking) as user-space services
Services communicate via message passing

The upside: Clean separation, easier to reason about, one buggy driver can't crash the kernel. The downside: Message passing is slow. Every filesystem operation requires multiple context switches.

Linux's monolithic approach:

Everything runs in kernel space with shared memory
Components call each other directly, like functions in a single program
Direct hardware access, no message passing overhead

The upside: Fast. Really fast. No IPC overhead for internal operations. The downside: Complex. A bug in a driver can take down the entire system.

Linux's clever compromise: Loadable kernel modules (.ko files) give you modularity without the performance hit. You can add or remove drivers at runtime, but they still run in kernel space with full privileges.

Monolithic (Linux)

KERNEL SPACE

VFS → ext4 → Block Layer → NVMe Driver

Scheduler • Memory Mgmt • Network Stack

USER SPACE

Application → glibc

✓ Fast: direct function calls

⚠ Risk: driver bug can crash kernel

Microkernel (Minix, QNX)

KERNEL SPACE

IPC + Scheduler only

USER SPACE

App → FS Server → IPC → Block Server → IPC → Driver

Network Server • Memory Server

✓ Safe: isolated servers, fault-tolerant

⚠ Slow: IPC overhead on every operation

Protection Rings: Security Clearances for Code

Think of protection rings like security clearances in a government building:

Ring 3 (User Mode) — "Public Access"

This is where your applications run: browsers, editors, games
Can't touch hardware directly—must ask the kernel for everything
If your program crashes, only your program dies
Like a visitor who can only access the lobby

Ring 0 (Kernel Mode) — "Top Secret Clearance"

Full hardware access: can execute any CPU instruction
Direct memory manipulation, device control
If kernel code crashes, the whole system goes down
Like a security officer with keys to every room

The syscall is the checkpoint: When your program needs something privileged (reading a file, sending network data), it makes a system call. This triggers a mode switch from Ring 3 to Ring 0—the CPU literally changes its operating mode. The kernel performs the operation and returns control to your program.

x86 provides four rings (0-3), but Linux only uses two. Rings 1 and 2 were intended for device drivers but are unused—Linux keeps things simple with just "user" and "kernel."

Ring 0

Ring 3

Ring 3 (User Mode): Process runs with limited privileges. Cannot access hardware directly—must request kernel services through system calls.

The Subsystems: What Each Layer Does

Rather than listing features, let's understand what problem each subsystem solves.

The problem: You have 8 CPU cores but 200 running processes. How do you give each one the illusion of having its own processor?

The solution: The scheduler rapidly switches between processes (typically every 1-10 milliseconds), saving and restoring their state so quickly that each process thinks it's running continuously.

Every process is tracked by a task_struct—a data structure containing everything the kernel needs to know: process ID, memory mappings, open files, scheduling priority, and more. When the scheduler decides it's time to switch, it saves the current process's registers to its task_struct and loads another's.

Why it matters: Understanding scheduling explains why CPU-bound tasks slow each other down, why nice values affect performance, and why real-time systems need different schedulers.

See Process Management for the complete picture of fork, exec, and scheduling algorithms.

Memory Management: Giving Each Process Its Own Universe

The problem: Multiple processes all want memory, but they can't be allowed to read or corrupt each other's data. And you might have less physical RAM than processes request.

The solution: Virtual memory. Each process gets its own address space—a private view of memory where address 0x400000 in one process maps to completely different physical RAM than the same address in another process.

The kernel maintains page tables that translate virtual addresses to physical addresses. When a process accesses memory, the CPU's MMU (Memory Management Unit) performs this translation in hardware. If a process tries to access memory it shouldn't, the MMU triggers a fault, and the kernel terminates the offender (the infamous "segmentation fault").

Key insight: Virtual memory enables overcommitment. You can run programs that collectively "use" more memory than you physically have, because the kernel can swap unused pages to disk and bring them back when needed.

See Memory Management for paging, TLB, and memory zones.

Virtual File System: The Universal Translator

The problem: There are dozens of filesystems—ext4, XFS, NTFS, NFS, tmpfs, procfs—but applications just want to open() a file. How do you support all of them without rewriting every program?

The solution: VFS provides a unified interface. Applications call generic functions (open, read, write), and VFS translates these into filesystem-specific operations.

VFS maintains four key data structures:

Superblock: Filesystem-wide metadata (block size, free space, mount options)
Inode: Per-file metadata (permissions, size, block locations)
Dentry: Directory entry cache (maps names like "config.txt" to inodes)
File: Open file state (current position, access mode)

When you open("/etc/passwd"), VFS walks the dentry cache, finds the inode, and creates a file structure. When you read(), VFS calls the filesystem's read implementation (e.g., ext4's block-reading code).

Why this matters: VFS is why cat /proc/cpuinfo works the same as cat /etc/passwd. The procfs filesystem generates "file" contents dynamically, but your tools don't need to know that.

See Filesystems Overview for more on how this abstraction works.

Network Stack: TCP/IP in the Kernel

The problem: Network protocols are complex state machines. TCP alone handles connection setup, reliable delivery, congestion control, and teardown. Should every application implement this?

The solution: The kernel implements the entire TCP/IP stack. Applications just send and receive bytes through sockets; the kernel handles everything else.

The stack follows the familiar layers:

Socket layer: Application interface (send, recv, connect)
Transport layer: TCP, UDP, SCTP protocol implementation
Network layer: IP routing, ICMP, packet forwarding
Link layer: Ethernet frames, ARP, driver interface

Netfilter provides hooks throughout the stack where packet filtering (iptables/nftables), NAT, and connection tracking plug in.

See Networking Stack for the complete journey of a packet.

Device Drivers: Translating Hardware Diversity

The problem: Thousands of hardware devices exist, each with different registers, protocols, and quirks. How does the kernel support them all without becoming unmaintainable?

The solution: The driver model. Each driver translates between the kernel's generic interface and device-specific operations.

Driver types:

Character devices: Byte streams—serial ports, terminals, /dev/random
Block devices: Random access to fixed-size blocks—SSDs, HDDs
Network devices: Packet-oriented interfaces—eth0, wlan0

Drivers register with the kernel, saying "I can handle devices with this ID." When hardware is detected, the kernel matches it to a driver and calls initialization functions. From then on, the driver handles all communication with that device.

Loadable modules (.ko files) allow adding drivers without recompiling the kernel. Run lsmod to see what's currently loaded; modprobe to load new ones.

A Syscall's Journey: Following read() Through the Kernel

Let's trace what happens when your program calls read(fd, buffer, 4096):

1. User Space (Ring 3) Your C program calls read(). The C library (glibc) prepares the syscall: it puts the syscall number (0 for read on x86-64), file descriptor, buffer address, and count into specific CPU registers.

2. The Syscall Instruction glibc executes the syscall instruction. This is the magic moment—the CPU:

Saves the current instruction pointer and stack pointer
Switches from Ring 3 to Ring 0
Jumps to the kernel's syscall entry point

This mode switch costs roughly 100-1000 CPU cycles. It's fast, but not free.

3. Kernel Entry The kernel's syscall handler looks up the syscall number in a table and calls sys_read(). It validates arguments: Is this file descriptor valid? Is the buffer address accessible?

4. VFS Layer VFS finds the file structure for this descriptor and calls the filesystem's read operation. For ext4, this means translating the file offset to block numbers.

5. Block Layer The block layer checks if the requested data is already in the page cache (RAM). If yes, it copies directly to the user's buffer. If not, it schedules I/O.

6. Device Driver For a cache miss, the driver sends a read command to the hardware controller. The request may complete immediately (fast SSD) or take milliseconds (spinning disk). The process is put to sleep while waiting.

7. Return Path When data arrives, it flows back up: driver → block layer → page cache → VFS → sys_read → syscall return. The syscall instruction's partner, sysret, switches back to Ring 3, restoring the user program's state.

8. Back in User Space read() returns with data in your buffer, completely unaware of the journey it just took.

Performance insight: This is why buffered I/O matters. Reading 1 byte 1000 times means 1000 mode switches. Reading 1000 bytes once means just one. The stdio library's buffering exists precisely to amortize syscall overhead.

Modern Kernel Features

cgroups: Resource Limits and Accounting

cgroups (control groups) let you limit how much CPU, memory, and I/O a group of processes can use. Docker containers? They're cgroups + namespaces.

# Limit a process group to 50% of one CPU core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us

See cgroups Deep Dive for resource controllers, v1 vs v2, and container usage.

Namespaces: Isolated Views of the System

Namespaces give processes their own isolated view of system resources:

PID namespace: Process 1 inside the container isn't the real init
Network namespace: Separate network stack, interfaces, routing tables
Mount namespace: Different filesystem view
User namespace: UID 0 inside maps to unprivileged user outside

Containers combine these to create isolated environments without the overhead of virtual machines.

See Linux Namespaces for all seven namespace types and how they enable containers.

eBPF: Programmable Kernel Extensions

eBPF lets you run sandboxed programs inside the kernel without modifying kernel code. Use cases:

Tracing: Attach to any kernel function, collect data
Networking: High-performance packet processing, load balancing
Security: Runtime security monitoring

The eBPF verifier ensures programs can't crash the kernel or access unauthorized memory.

PID Namespace

PID 1systemd

PID 12345nginx

PID 12346worker

Host sees all processes with real PIDs

Network Namespace

eth0192.168.1.100

docker0172.17.0.1

lo127.0.0.1

Host manages bridge network

cgroup CPU Limit (50%)

Container cannot exceed allocated CPU quota

Key Takeaways

The kernel is the mediator between your programs and hardware—it enforces security, shares resources fairly, and abstracts hardware differences.
Protection rings create security boundaries. User programs (Ring 3) can't directly touch hardware; they must ask the kernel (Ring 0) through syscalls.
Linux is monolithic but modular. All kernel code runs in Ring 0 for performance, but loadable modules provide flexibility.
VFS is the universal translator that lets the same code work with any filesystem.
Syscalls have overhead. The mode switch costs cycles—batch your I/O when performance matters.
Modern containers use kernel features (cgroups, namespaces) rather than full virtualization.

Where to Go Next

Boot Process — From power button to login prompt
Process Management — fork, exec, and the scheduler in depth
Memory Management — Virtual memory, paging, and the page cache
System Calls — The complete user→kernel interface
Kernel Modules — Writing and loading kernel extensions
Linux Namespaces — The seven namespace types for process isolation
Linux cgroups — Resource limits for CPU, memory, and I/O
Containers Under the Hood — How namespaces + cgroups create Docker