Linux Networking Stack: From Packets to Applications

The Internet Post Office

Every time you browse a website, your computer performs an intricate dance involving multiple layers of wrapping, addressing, and routing. Think of it like a multinational postal system:

📬 The Network as a Postal System

Your message
→ The letter content (HTTP request, file data)
TCP envelope
→ Tracking number + delivery confirmation (ensures nothing gets lost)
IP envelope
→ Street address (tells routers where to send it)
Ethernet envelope
→ Local mailroom routing (MAC addresses for the local network)
Security checkpoint
→ iptables/netfilter (inspects and filters every package)

Let's open each envelope and see how your data actually travels from application to wire.

Packet Encapsulation: The Russian Nesting Doll

When you send data, it gets wrapped in headers at each layer—like putting a letter in progressively larger envelopes. Watch the encapsulation process:

Packet Encapsulation Journey

Watch how your data gets wrapped in headers as it travels through the network stack. Each layer adds its own "envelope" with routing and control information.

Direction:

Data flows down through layers

Application

HTTP

12B

Your application writes "Hello World" to a socket

Transport

TCP

Network

Data Link

Ethernet

Packet Structure

Data 12B

12B

Payload

Headers

12B

On Wire

Application data ready to send

Why So Much Overhead?

Each header contains essential routing and control information. For a 12-byte payload, you need 54 bytes of headers—that's 82% overhead! This is why small packets are inefficient: the same overhead applies whether you send 12 bytes or 1400 bytes.

Layer Responsibilities

Each layer has a specific job:

Layer	Protocol	What It Adds	Why
Application	HTTP	Request/response	Your actual data
Transport	TCP	Ports + sequencing	Which app, reliable delivery
Network	IP	IP addresses	Where to route globally
Data Link	Ethernet	MAC addresses	Where on local network

💡

Why So Much Overhead?

For a 12-byte "Hello World", you add 54 bytes of headers—that's 82% overhead! This is why small packets are inefficient, and why protocols like HTTP/2 batch multiple requests together. The overhead is the same whether you send 12 bytes or 1,400 bytes.

The TCP Handshake: Establishing Trust

Before any data flows, TCP performs a "three-way handshake" to establish a reliable connection. This is one of the most asked-about networking concepts:

TCP Three-Way Handshake

Before any data can flow, TCP establishes a reliable connection through this handshake. Watch the sequence numbers—they're how TCP tracks and orders every byte.

Phase:

Client

State

CLOSED

Server

State

LISTEN

Ready

Server is listening, client ready to connect

Step 1/5

Why Three Packets?

The three-way handshake ensures both sides can send AND receive:

SYN: Client proves it can send
SYN-ACK: Server proves it can receive AND send
ACK: Client proves it can receive

Why Three Packets?

The handshake proves that both sides can send AND receive:

Client                              Server
   |                                   |
   |  -------- SYN (seq=1000) ------>  |  Client proves: "I can send"
   |                                   |
   |  <-- SYN-ACK (seq=2000,ack=1001)  |  Server proves: "I can receive AND send"
   |                                   |
   |  -------- ACK (ack=2001) ------>  |  Client proves: "I can receive"
   |                                   |
   |        Connection ESTABLISHED     |

The sequence numbers (1000, 2000, etc.) are how TCP tracks every byte—they're essential for detecting lost or reordered packets.

Socket Programming: The Application Interface

Sockets are the API between applications and the network stack. Under the hood, every socket operation is a system call that transitions from user space into the kernel.

Socket Lifecycle Demo

Sockets are just file descriptors—integers that reference kernel data structures. Watch the system calls that create, configure, and use them.

Role:

Step 1/8

server

CLOSED

No socket exists yet

Code:

// Server starting up...

System Call Sequence:

File Descriptors Are Just Integers

Every socket is a file descriptor—just an integer that indexes into the kernel's per-process file table. This is Unix's "everything is a file" philosophy in action:

fd 0, 1, 2 = stdin, stdout, stderr
fd 3+ = your sockets, files, pipes
accept() returns a new fd for each client

Socket Types

// TCP Socket (reliable, ordered delivery)
int tcp_sock = socket(AF_INET, SOCK_STREAM, 0);

// UDP Socket (fast, no guarantees)
int udp_sock = socket(AF_INET, SOCK_DGRAM, 0);

// Raw Socket (direct IP access, requires root)
int raw_sock = socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);

// Unix Domain Socket (local IPC, no network overhead)
int unix_sock = socket(AF_UNIX, SOCK_STREAM, 0);

TCP Server Pattern

#include <sys/socket.h>
#include <netinet/in.h>

int create_server(int port) {
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);

    // Allow port reuse (avoid "Address already in use")
    int opt = 1;
    setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_addr.s_addr = INADDR_ANY,
        .sin_port = htons(port)
    };

    bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(server_fd, 128);  // Backlog queue

    while (1) {
        int client = accept(server_fd, NULL, NULL);
        handle_client(client);
        close(client);
    }
}

High-Performance I/O with epoll

For handling thousands of connections:

int epfd = epoll_create1(0);

struct epoll_event ev = {
    .events = EPOLLIN | EPOLLET,  // Edge-triggered
    .data.fd = socket_fd
};
epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);

// Event loop
struct epoll_event events[MAX_EVENTS];
while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < n; i++) {
        handle_event(events[i].data.fd);
    }
}

Routing and Forwarding

The kernel decides where to send each packet based on its routing table. Understanding how routes are selected is crucial for network debugging.

Routing Decision Demo

The kernel uses longest prefix matching to select routes. Watch how it evaluates each route and picks the most specific match.

Destination IP:

Destination IP in binary:

00001000000010000000100000001000

Routing Table (sorted by prefix length):

Ready

192.168.1.100/32eth0

direct • Host Route

192.168.1.0/24eth0

direct • Local Network

172.17.0.0/16docker0

direct • Docker Network

10.0.0.0/8vpn0

via 10.0.0.1 • VPN Network

0.0.0.0/0eth0

via 192.168.1.1 • Default Gateway

Longest Prefix Wins

When multiple routes match, the kernel picks the most specific one (longest prefix). This is why:

/32 host routes override network routes
/24 beats /16 beats /8
0.0.0.0/0 is the default—matches everything but loses to any specific route

Viewing Routes

# Modern way
ip route show

# Output:
# default via 192.168.1.1 dev eth0
# 192.168.1.0/24 dev eth0 scope link
# 172.17.0.0/16 dev docker0 scope link

Adding Routes

# Static route to a network
ip route add 10.0.0.0/8 via 192.168.1.1 dev eth0

# Default gateway
ip route add default via 192.168.1.1

# Source-based routing (advanced)
ip rule add from 192.168.2.0/24 table custom
ip route add default via 10.0.0.1 table custom

Enable IP Forwarding (Router Mode)

# Temporarily
echo 1 > /proc/sys/net/ipv4/ip_forward

# Permanently
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
sysctl -p

Netfilter: The Kernel's Security Checkpoint

Every packet entering or leaving your system passes through netfilter—a core part of the Linux kernel architecture. This is what iptables (and its successor nftables) uses for packet filtering.

Netfilter Packet Flow

Watch packets traverse the netfilter hooks. Different packet types take different paths— understanding this is key to writing correct iptables rules.

Network

Local Process

Routing

PREROUTING

INPUT

FORWARD

OUTPUT

POSTROUTING

Path: PREROUTING → INPUT

Test Packets:

iptables Rules (INPUT chain):

INPUT-m state --state ESTABLISHED,RELATED

Allow established connections

INPUT-p tcp --dport 22

Allow SSH

INPUT-p tcp --dport 80

Allow HTTP

INPUT-i lo

Allow loopback

Quick Add Rule:

Rule Order Matters!

iptables processes rules top-to-bottom—first matching rule wins. This is why you should:

Put ESTABLISHED,RELATED rule first (most traffic matches this)
Add specific ACCEPT rules for allowed services
Set default policy to DROP (deny by default)

Understanding Netfilter Hooks

Packets traverse different hooks depending on their destination:

                          ┌─────────────────┐
                          │  Local Process  │
                          └────────┬────────┘
                                   │
    Network ──►  PREROUTING ──►  INPUT
        │                          ▲
        │                          │ (local destination)
        │                          │
        └──────►  FORWARD ─────────┴──►  POSTROUTING ──►  Network
                    │                         ▲
                    └─────────────────────────┘
                      (forwarded packets)

                          OUTPUT ────────────►  POSTROUTING ──►  Network
                            ▲
                            │
                    ┌───────┴───────┐
                    │ Local Process │
                    └───────────────┘

Common iptables Rules

# Allow established connections (most efficient rule first!)
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT

# Allow SSH from specific network
iptables -A INPUT -p tcp --dport 22 -s 192.168.1.0/24 -j ACCEPT

# Allow HTTP/HTTPS
iptables -A INPUT -p tcp -m multiport --dports 80,443 -j ACCEPT

# Drop everything else (default deny)
iptables -P INPUT DROP

⚠️

Rule Order Matters!

iptables processes rules top-to-bottom—first match wins. If you put -j DROP before -j ACCEPT, packets will be dropped before reaching the accept rule. Always put your most-matched rules (like ESTABLISHED) first for performance.

Performance Tuning

Network performance depends heavily on memory management—the kernel allocates buffers (sk_buff structures) for every packet in flight. The most critical tuning parameter is the TCP buffer size.

TCP Buffer Tuning Simulator

TCP buffers must be sized correctly for your network conditions. The Bandwidth-Delay Product (BDP) tells you the optimal buffer size.

Bandwidth100 Mbps

Round-Trip Latency (RTT)50 ms

Bandwidth-Delay Product (BDP):

100 Mbps×50 ms=610.4 KB

This is the amount of data "in flight" at any moment when the link is fully utilized.

TCP Buffer Size131,072

128.0 KB

Buffer Utilization0%

0 B / 128.0 KB

Buffer too small — throughput limited!

Achieved Throughput21.0 Mbps

0 Mbps21% of max (100 Mbps)100 Mbps

Apply these settings:

sysctl commands

# Set maximum buffer sizes
sysctl -w net.core.rmem_max=131072
sysctl -w net.core.wmem_max=131072

# Set TCP buffer auto-tuning range
sysctl -w net.ipv4.tcp_rmem="4096 87380 131072"
sysctl -w net.ipv4.tcp_wmem="4096 65536 131072"

Why Buffer Size Matters for High-Latency Links

TCP can only have one buffer's worth of data in flight before waiting for ACKs. On high-latency links (like satellite), this becomes the bottleneck:

64KB buffer + 600ms RTT = max 853 Kbps (throttled!)
BDP-sized buffer = full bandwidth utilization
Too large = wasted memory, possible bufferbloat

TCP Buffer Tuning Commands

# Increase buffer sizes for high-bandwidth links
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

# Use BBR congestion control (Google's algorithm)
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

Network Interface Tuning

# Increase ring buffer
ethtool -G eth0 rx 4096 tx 4096

# Enable offloading
ethtool -K eth0 gso on gro on tso on

# Receive Packet Steering (multi-core)
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus

Network Namespaces

Linux can create isolated network environments—this is how containers get their own network stack. Network namespaces work alongside other namespace types (PID, mount, user namespaces) to provide full container isolation.

# Create namespace
ip netns add mycontainer

# Create virtual ethernet pair
ip link add veth0 type veth peer name veth1

# Move one end into namespace
ip link set veth1 netns mycontainer

# Configure host side
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

# Configure container side
ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth1
ip netns exec mycontainer ip link set veth1 up
ip netns exec mycontainer ip link set lo up

# Test connectivity
ping 10.0.0.2

Debugging Tools

Essential Commands

# Active connections
ss -tunap

# Packet capture
tcpdump -i eth0 'tcp port 80'

# Route tracing
traceroute google.com
mtr google.com  # Better interactive version

# Performance testing
iperf3 -s        # Server
iperf3 -c host   # Client

Common Pitfalls

🚨 Common Networking Mistakes

Small buffers on high-latency links

Default 64KB buffers throttle satellite/VPN connections. Calculate BDP and size buffers accordingly.

iptables rule ordering mistakes

Putting DROP before ACCEPT, or putting expensive rules before ESTABLISHED. First match wins!

Forgetting to enable ip_forward

Linux won't route packets between interfaces unless net.ipv4.ip_forward=1.

Blocking I/O for high-concurrency servers

Use epoll/kqueue for thousands of connections. Blocking I/O means one thread per connection.

Ignoring connection state tracking overhead

conntrack uses memory per connection. High-traffic servers may need to tune nf_conntrack_max.

Key Takeaways

🎯 What You Should Remember

Encapsulation

Data gets wrapped in headers at each layer. Each layer only understands its own header—that's the beauty of layering.

TCP Handshake

Three packets establish that both sides can send AND receive. Sequence numbers track every byte.

Sockets Are File Descriptors

Just integers indexing into kernel tables. accept() returns a NEW fd per client while the server keeps listening.

Longest Prefix Wins

Routing uses longest prefix matching. /32 beats /24 beats /16. Default route (0.0.0.0/0) is the fallback.

Netfilter Paths

Local traffic: PREROUTING → INPUT. Forwarded traffic: PREROUTING → FORWARD → POSTROUTING. Outgoing: OUTPUT → POSTROUTING.

Buffer = BDP

TCP buffers should match the Bandwidth-Delay Product for full throughput on high-latency links.

Understanding the networking stack empowers you to debug connectivity issues, build high-performance servers, and secure your systems. Every web request, every SSH session, every API call traverses this intricate system.