Linux Kernel Architecture: How Your OS Actually Works

Linux kernel architecture explained. Learn syscalls, protection rings, user vs kernel space, and what happens when you run a command.

Best viewed on desktop for optimal interactive experience

The Invisible City That Runs Your Computer

Every time you press a key, open a file, or browse the web, you're relying on the Linux kernel—a 30+ million line program that most users never think about. It's like a city's infrastructure: invisible when working perfectly, essential always.

What happens when you type ls?

  1. Your shell (user space) asks the kernel for directory contents
  2. The kernel checks if you're allowed to read that directory
  3. It asks the filesystem layer for the data
  4. The filesystem asks the disk driver for the actual bytes
  5. Data flows back up through each layer to your terminal

All of this happens in microseconds, thousands of times per second. The kernel orchestrates this dance between your programs and your hardware, ensuring that each program gets fair access to resources, security boundaries are enforced, and the whole system doesn't collapse when one program misbehaves.

Why Understanding This Matters

If you've ever wondered:

  • Why does my program slow down when another process is busy?
  • Why can't my app directly read from disk?
  • What actually happens during a "segfault"?
  • Why does Docker use "namespaces" and "cgroups"?

The answers all live in kernel architecture. Understanding these concepts transforms debugging from guesswork into systematic investigation.


Explore the Architecture

Click on each layer to understand the problem it solves. Then trace a syscall to see how data flows through the entire stack.

Click a layer to learn what problem it solves and how it works.

Monolithic Design

All kernel services run in Ring 0 with direct function calls. Fast, but one bug can crash everything.

Loadable Modules

Linux's compromise: get modularity via .ko files without microkernel's IPC overhead.


The Big Picture: How Linux Is Organized

The Great Debate: Monolithic vs Microkernel

Before we dive into Linux's design, let's understand why it's built this way.

The core problem: How do you organize millions of lines of operating system code?

The microkernel approach (used by Minix, QNX, some embedded systems):

  • Keep the kernel minimal: just IPC, basic scheduling, and memory management
  • Run everything else (filesystems, drivers, networking) as user-space services
  • Services communicate via message passing

The upside: Clean separation, easier to reason about, one buggy driver can't crash the kernel. The downside: Message passing is slow. Every filesystem operation requires multiple context switches.

Linux's monolithic approach:

  • Everything runs in kernel space with shared memory
  • Components call each other directly, like functions in a single program
  • Direct hardware access, no message passing overhead

The upside: Fast. Really fast. No IPC overhead for internal operations. The downside: Complex. A bug in a driver can take down the entire system.

Linux's clever compromise: Loadable kernel modules (.ko files) give you modularity without the performance hit. You can add or remove drivers at runtime, but they still run in kernel space with full privileges.

Monolithic (Linux)
KERNEL SPACE
VFS → ext4 → Block Layer → NVMe Driver
Scheduler • Memory Mgmt • Network Stack
USER SPACE
Application → glibc
✓ Fast: direct function calls
⚠ Risk: driver bug can crash kernel
Microkernel (Minix, QNX)
KERNEL SPACE
IPC + Scheduler only
USER SPACE
App → FS Server → IPC → Block Server → IPC → Driver
Network Server • Memory Server
✓ Safe: isolated servers, fault-tolerant
⚠ Slow: IPC overhead on every operation

Protection Rings: Security Clearances for Code

Think of protection rings like security clearances in a government building:

Ring 3 (User Mode) — "Public Access"

  • This is where your applications run: browsers, editors, games
  • Can't touch hardware directly—must ask the kernel for everything
  • If your program crashes, only your program dies
  • Like a visitor who can only access the lobby

Ring 0 (Kernel Mode) — "Top Secret Clearance"

  • Full hardware access: can execute any CPU instruction
  • Direct memory manipulation, device control
  • If kernel code crashes, the whole system goes down
  • Like a security officer with keys to every room

The syscall is the checkpoint: When your program needs something privileged (reading a file, sending network data), it makes a system call. This triggers a mode switch from Ring 3 to Ring 0—the CPU literally changes its operating mode. The kernel performs the operation and returns control to your program.

x86 provides four rings (0-3), but Linux only uses two. Rings 1 and 2 were intended for device drivers but are unused—Linux keeps things simple with just "user" and "kernel."

Ring 0
Ring 3
Ring 3 (User Mode): Process runs with limited privileges. Cannot access hardware directly—must request kernel services through system calls.

The Subsystems: What Each Layer Does

Rather than listing features, let's understand what problem each subsystem solves.

Process Management: Sharing One CPU Among Many

The problem: You have 8 CPU cores but 200 running processes. How do you give each one the illusion of having its own processor?

The solution: The scheduler rapidly switches between processes (typically every 1-10 milliseconds), saving and restoring their state so quickly that each process thinks it's running continuously.

Every process is tracked by a task_struct—a data structure containing everything the kernel needs to know: process ID, memory mappings, open files, scheduling priority, and more. When the scheduler decides it's time to switch, it saves the current process's registers to its task_struct and loads another's.

Why it matters: Understanding scheduling explains why CPU-bound tasks slow each other down, why nice values affect performance, and why real-time systems need different schedulers.

See Process Management for the complete picture of fork, exec, and scheduling algorithms.

Memory Management: Giving Each Process Its Own Universe

The problem: Multiple processes all want memory, but they can't be allowed to read or corrupt each other's data. And you might have less physical RAM than processes request.

The solution: Virtual memory. Each process gets its own address space—a private view of memory where address 0x400000 in one process maps to completely different physical RAM than the same address in another process.

The kernel maintains page tables that translate virtual addresses to physical addresses. When a process accesses memory, the CPU's MMU (Memory Management Unit) performs this translation in hardware. If a process tries to access memory it shouldn't, the MMU triggers a fault, and the kernel terminates the offender (the infamous "segmentation fault").

Key insight: Virtual memory enables overcommitment. You can run programs that collectively "use" more memory than you physically have, because the kernel can swap unused pages to disk and bring them back when needed.

See Memory Management for paging, TLB, and memory zones.

Virtual File System: The Universal Translator

The problem: There are dozens of filesystems—ext4, XFS, NTFS, NFS, tmpfs, procfs—but applications just want to open() a file. How do you support all of them without rewriting every program?

The solution: VFS provides a unified interface. Applications call generic functions (open, read, write), and VFS translates these into filesystem-specific operations.

VFS maintains four key data structures:

  • Superblock: Filesystem-wide metadata (block size, free space, mount options)
  • Inode: Per-file metadata (permissions, size, block locations)
  • Dentry: Directory entry cache (maps names like "config.txt" to inodes)
  • File: Open file state (current position, access mode)

When you open("/etc/passwd"), VFS walks the dentry cache, finds the inode, and creates a file structure. When you read(), VFS calls the filesystem's read implementation (e.g., ext4's block-reading code).

Why this matters: VFS is why cat /proc/cpuinfo works the same as cat /etc/passwd. The procfs filesystem generates "file" contents dynamically, but your tools don't need to know that.

See Filesystems Overview for more on how this abstraction works.

Network Stack: TCP/IP in the Kernel

The problem: Network protocols are complex state machines. TCP alone handles connection setup, reliable delivery, congestion control, and teardown. Should every application implement this?

The solution: The kernel implements the entire TCP/IP stack. Applications just send and receive bytes through sockets; the kernel handles everything else.

The stack follows the familiar layers:

  • Socket layer: Application interface (send, recv, connect)
  • Transport layer: TCP, UDP, SCTP protocol implementation
  • Network layer: IP routing, ICMP, packet forwarding
  • Link layer: Ethernet frames, ARP, driver interface

Netfilter provides hooks throughout the stack where packet filtering (iptables/nftables), NAT, and connection tracking plug in.

See Networking Stack for the complete journey of a packet.

Device Drivers: Translating Hardware Diversity

The problem: Thousands of hardware devices exist, each with different registers, protocols, and quirks. How does the kernel support them all without becoming unmaintainable?

The solution: The driver model. Each driver translates between the kernel's generic interface and device-specific operations.

Driver types:

  • Character devices: Byte streams—serial ports, terminals, /dev/random
  • Block devices: Random access to fixed-size blocks—SSDs, HDDs
  • Network devices: Packet-oriented interfaces—eth0, wlan0

Drivers register with the kernel, saying "I can handle devices with this ID." When hardware is detected, the kernel matches it to a driver and calls initialization functions. From then on, the driver handles all communication with that device.

Loadable modules (.ko files) allow adding drivers without recompiling the kernel. Run lsmod to see what's currently loaded; modprobe to load new ones.


A Syscall's Journey: Following read() Through the Kernel

Let's trace what happens when your program calls read(fd, buffer, 4096):

1. User Space (Ring 3) Your C program calls read(). The C library (glibc) prepares the syscall: it puts the syscall number (0 for read on x86-64), file descriptor, buffer address, and count into specific CPU registers.

2. The Syscall Instruction glibc executes the syscall instruction. This is the magic moment—the CPU:

  • Saves the current instruction pointer and stack pointer
  • Switches from Ring 3 to Ring 0
  • Jumps to the kernel's syscall entry point

This mode switch costs roughly 100-1000 CPU cycles. It's fast, but not free.

3. Kernel Entry The kernel's syscall handler looks up the syscall number in a table and calls sys_read(). It validates arguments: Is this file descriptor valid? Is the buffer address accessible?

4. VFS Layer VFS finds the file structure for this descriptor and calls the filesystem's read operation. For ext4, this means translating the file offset to block numbers.

5. Block Layer The block layer checks if the requested data is already in the page cache (RAM). If yes, it copies directly to the user's buffer. If not, it schedules I/O.

6. Device Driver For a cache miss, the driver sends a read command to the hardware controller. The request may complete immediately (fast SSD) or take milliseconds (spinning disk). The process is put to sleep while waiting.

7. Return Path When data arrives, it flows back up: driver → block layer → page cache → VFS → sys_read → syscall return. The syscall instruction's partner, sysret, switches back to Ring 3, restoring the user program's state.

8. Back in User Space read() returns with data in your buffer, completely unaware of the journey it just took.

Performance insight: This is why buffered I/O matters. Reading 1 byte 1000 times means 1000 mode switches. Reading 1000 bytes once means just one. The stdio library's buffering exists precisely to amortize syscall overhead.


Modern Kernel Features

cgroups: Resource Limits and Accounting

cgroups (control groups) let you limit how much CPU, memory, and I/O a group of processes can use. Docker containers? They're cgroups + namespaces.

# Limit a process group to 50% of one CPU core echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us

See cgroups Deep Dive for resource controllers, v1 vs v2, and container usage.

Namespaces: Isolated Views of the System

Namespaces give processes their own isolated view of system resources:

  • PID namespace: Process 1 inside the container isn't the real init
  • Network namespace: Separate network stack, interfaces, routing tables
  • Mount namespace: Different filesystem view
  • User namespace: UID 0 inside maps to unprivileged user outside

Containers combine these to create isolated environments without the overhead of virtual machines.

See Linux Namespaces for all seven namespace types and how they enable containers.

eBPF: Programmable Kernel Extensions

eBPF lets you run sandboxed programs inside the kernel without modifying kernel code. Use cases:

  • Tracing: Attach to any kernel function, collect data
  • Networking: High-performance packet processing, load balancing
  • Security: Runtime security monitoring

The eBPF verifier ensures programs can't crash the kernel or access unauthorized memory.

PID Namespace
PID 1systemd
PID 12345nginx
PID 12346worker
Host sees all processes with real PIDs
Network Namespace
eth0192.168.1.100
docker0172.17.0.1
lo127.0.0.1
Host manages bridge network
cgroup CPU Limit (50%)
Container cannot exceed allocated CPU quota

Key Takeaways

  1. The kernel is the mediator between your programs and hardware—it enforces security, shares resources fairly, and abstracts hardware differences.

  2. Protection rings create security boundaries. User programs (Ring 3) can't directly touch hardware; they must ask the kernel (Ring 0) through syscalls.

  3. Linux is monolithic but modular. All kernel code runs in Ring 0 for performance, but loadable modules provide flexibility.

  4. VFS is the universal translator that lets the same code work with any filesystem.

  5. Syscalls have overhead. The mode switch costs cycles—batch your I/O when performance matters.

  6. Modern containers use kernel features (cgroups, namespaces) rather than full virtualization.


Where to Go Next

If you found this explanation helpful, consider sharing it with others.

Mastodon