Transparent Huge Pages (THP): Reducing TLB Pressure

Transparent Huge Pages (THP)

Why THP Matters

Every time your program accesses memory, the CPU must translate a virtual address into a physical one. This translation relies on the Translation Lookaside Buffer (TLB) -- a tiny, fast cache that stores recent address mappings. When the TLB has the answer, translation takes 1-2 cycles. When it does not (a TLB miss), the CPU must walk through up to four levels of page tables in memory, costing 10-20 cycles or more.

Here is the problem: with standard 4 KB pages, even a generous TLB with 1,536 entries only covers about 6 MB of memory. A database with a 10 GB working set will miss the TLB constantly -- over 99.9% of its address space has no TLB entry at any given moment. Those misses add up to a serious performance tax on memory-intensive workloads.

Transparent Huge Pages solve this by automatically using 2 MB pages instead of 4 KB pages. A single TLB entry now covers 512 times more memory, and the page table walk is one level shorter. The "transparent" part means the kernel handles this promotion behind the scenes -- applications do not need to be rewritten. But this convenience comes with real trade-offs around memory fragmentation, latency spikes, and wasted memory that make THP a feature you need to understand, not just enable.

The Page Table Walk Problem

Modern x86-64 systems use a four-level page table structure. When a TLB miss occurs, the hardware must chase pointers through four successive tables -- PGD, PUD, PMD, and PTE -- each requiring a memory access. With huge pages, the walk stops one level earlier (at the PMD), because the final 21 bits of the address serve as the offset within the 2 MB page rather than indexing into a PTE table.

This shorter walk means fewer memory accesses per miss, and the dramatically improved TLB coverage means fewer misses overall. The combined effect can be substantial for workloads that touch large amounts of memory.

How TLB Coverage Changes

The arithmetic is striking. With 4 KB pages and 1,536 TLB entries, you cover 6 MB. With 2 MB pages and the same number of entries, you cover 3 GB -- a 512x improvement from the same hardware. For a server running a database with tens of gigabytes in memory, this transforms the TLB from a resource under constant pressure into one that comfortably covers the working set.

Compare the page walk paths side by side:

How the Kernel Creates Huge Pages

Linux provides two mechanisms for creating huge pages transparently, each with different performance characteristics.

Synchronous Allocation: The Fast Path

When a process triggers a page fault on a virtual address that is 2 MB-aligned, the kernel checks whether a contiguous 2 MB physical region is available. If one is free, the kernel allocates a huge page immediately. This fast path adds only 0.1-0.5 milliseconds of latency beyond a normal page fault.

If no contiguous region is available, the kernel faces a choice depending on its defragmentation settings: it can either fall back to a standard 4 KB page (fast but forfeits the THP benefit) or attempt to compact memory on the spot to create a contiguous region (which can stall the application for 10-100 milliseconds or more).

Asynchronous Promotion: khugepaged

The kernel runs a background daemon called khugepaged that periodically scans memory looking for opportunities to collapse 512 contiguous 4 KB pages into a single 2 MB huge page. This is the gentler path -- it happens in the background without stalling the application, but it takes time. A region might exist as small pages for seconds or minutes before khugepaged gets around to promoting it.

khugepaged Page Collapse Operation

Before: 512 × 4KB pages

PMD points to Page Table with 512 PTEs, each pointing to scattered 4KB physical pages

khugepaged scans and collapses →

After: 1 × 2MB huge page

PMD points directly to 2MB contiguous physical page (PSE=1)

Benefits:

✓ Page table memory freed: 4KB saved (512 PTEs × 8 bytes)
✓ TLB coverage: 512× increase (1 TLB entry instead of 512)
✓ Page walk cost: 25% reduction (3 levels instead of 4)
⚠ Migration harder: must move 2MB contiguous block

The daemon is configurable: you can control how frequently it scans, how many pages it examines per scan, and how aggressively it works. Conservative settings (scanning every 10 seconds, examining 4,096 pages per pass) minimize background CPU overhead. Aggressive settings (scanning every second, examining 8,192 pages) achieve higher THP coverage at the cost of more background work.

The Fragmentation Challenge

Physical memory fragmentation is the central obstacle to THP. Over time, as processes allocate and free memory in varying sizes, the physical address space becomes a patchwork of used and free regions. Finding 512 contiguous free pages (2 MB) in a fragmented address space can be difficult or impossible without moving existing pages around.

Memory Fragmentation and THP Allocation

Fragmented Memory (THP fails)

Process A

Process B

✗ No 2MB contiguous region available

After Defragmentation

✓ 2MB contiguous region created

THP allocation succeeds for Process A

Defragmentation Process

kcompactd/kswapd migrate pages to create contiguous regions. May require reclaiming page cache or moving active pages.

Defragmentation Modes

The kernel offers several strategies for handling the situation when a contiguous region is not available. Each represents a different trade-off between THP coverage and latency predictability.

Mode	Behavior	Latency Risk	Best For
always	Synchronously compacts memory on every THP allocation attempt	Severe (10-100ms stalls)	Benchmarking only
defer	Hands off to a background daemon (kcompactd)	None for the application	General-purpose servers
defer+madvise	Background for most allocations; synchronous for regions the application explicitly requests	Low, controlled	Databases using madvise
madvise	Only attempts THP for regions the application explicitly opts into	None for non-opted regions	Production (recommended)
never	No defragmentation at all	Zero	Real-time or latency-critical systems

The "always" mode sounds appealing but is dangerous in production. When the kernel stalls an application thread for 50-100 milliseconds to compact memory, database query latencies spike, web requests time out, and tail latencies become unpredictable. The "madvise" mode is the widely recommended production choice because it lets informed applications opt in to THP for their large, long-lived allocations while protecting everything else from unexpected stalls.

Where THP Helps Most

THP Performance Impact by Workload Type

Database

+30%

ML Train

+12%

Analytics

+25%

Video Enc

+18%

Memcpy

+50%

Web Fork

-28%

Sparse

-22%

Speedup

Slowdown

THP delivers its greatest benefits to workloads that combine large memory footprints with dense, sequential access patterns. These are workloads where TLB pressure is a genuine bottleneck rather than a minor overhead.

Databases like PostgreSQL, MongoDB, and Redis maintain large buffer pools (often 10 GB or more) and perform sequential scans over huge tables. THP can reduce TLB misses by 60-70% and improve throughput by 15-35%.

Machine learning training involves large, contiguous tensor allocations that are accessed sequentially during forward and backward passes. THP typically provides a 5-15% training speedup by improving cache line utilization and reducing page walk overhead.

In-memory analytics engines like Spark and ClickHouse process billions of rows in columnar data structures. The combination of large allocations and sequential scans makes them ideal THP candidates, with measured query speedups of 20-40%.

Workload	Typical Improvement	TLB Miss Reduction	Recommended Mode
PostgreSQL	+30% throughput	70%	madvise
MongoDB	+25% throughput	65%	madvise
Redis	+31% throughput	68%	madvise
ML Training (PyTorch)	+9% throughput	45%	defer
Spark Analytics	+35% throughput	72%	madvise
Video Encoding	+18% throughput	50%	defer

The Dark Side: Memory Bloat

THP's biggest hidden cost is internal fragmentation. When an application allocates a small amount of memory -- say, 100 bytes -- the kernel may back the entire 2 MB huge page for that allocation. The remaining 2,097,052 bytes are wasted.

Internal Fragmentation: Memory Bloat with THP

Without THP (4KB pages)

Process allocates 100 × 5KB objects

5KB

... × 100

Total: 500KB used

Each 5KB uses 2 × 4KB pages = 8KB

100 × 8KB = 800KB total RSS

Waste: 300KB (37.5%)

With THP enabled (2MB pages)

Same 100 × 5KB objects

2MB Huge Page

500KB used

1.5MB wasted!

Total: 2MB allocated

Kernel allocates full 2MB huge page

Only 500KB used = 2048KB total RSS

Waste: 1548KB (75.6%!) - 2.5× memory bloat!

Memory Bloat Risk

Workloads with many small allocations can see 2-4× memory usage increase with THP!

This is particularly damaging for applications that allocate many small objects. Redis, for example, stores millions of small key-value pairs (50-100 bytes each). With THP in "always" mode, each small allocation can trigger a 2 MB huge page, inflating memory usage by 2-3x. Node.js applications with their many small JavaScript objects see similar bloat. Java applications using the G1 garbage collector, which works with 1 MB regions, waste roughly 50% of each 2 MB huge page.

The bloat is not just about wasted RAM. On memory-constrained systems, it can trigger out-of-memory kills. And when the system needs to swap, it must move entire 2 MB chunks to disk rather than surgical 4 KB pages, amplifying I/O pressure.

When THP Hurts Performance

Beyond memory bloat, THP actively degrades performance in several scenarios:

Fork-heavy workloads suffer because fork() must handle huge pages during copy-on-write. Each 2 MB page that gets written after a fork must be copied in full, making fork operations 2-3x slower. This matters for web servers using prefork models (like Apache) and for Redis, which forks for background persistence.

Sparse memory access patterns waste the effort of loading 2 MB pages when only 4 KB within each page is ever touched. The page fault brings in a full 2 MB from physical memory, consuming bandwidth and cache space for data that will never be read.

Rapid allocation and deallocation cycles (common in memory allocators handling many short-lived objects) prevent khugepaged from ever successfully promoting pages. The overhead of scanning and attempting promotion is spent with no benefit.

Choosing the Right Configuration

The right THP configuration depends on your workload characteristics:

Workload	THP Enabled	Defrag Mode	khugepaged
Database (large buffer pool)	madvise	madvise	Aggressive
Web server (fork-based)	never	never	Disabled
ML training	defer	defer	Moderate
Redis / Memcached	madvise	madvise	Moderate
Java (large heap, G1GC)	madvise	defer	Conservative
Real-time / latency-critical	never	never	Disabled

For most production systems, the safest starting point is madvise mode for both the enabled and defrag settings. This gives applications that understand their memory patterns the ability to opt in (by calling madvise(MADV_HUGEPAGE) on their large, long-lived allocations) while preventing the kernel from making autonomous decisions that cause bloat or latency spikes.

Applications that want to opt out of THP for specific regions (such as small-object heaps) can use madvise(MADV_NOHUGEPAGE) to explicitly prevent promotion.

Key Takeaways

THP exists to solve TLB pressure. With 4 KB pages, a 1,536-entry TLB covers only 6 MB. With 2 MB pages, the same TLB covers 3 GB -- a 512x improvement that dramatically reduces expensive page table walks.
The kernel offers two promotion paths. Synchronous allocation is fast when contiguous memory is available; the khugepaged daemon promotes pages in the background over time. Both are transparent to the application.
Fragmentation is the fundamental challenge. Finding 512 contiguous physical pages becomes harder as memory ages. Defragmentation can solve this but introduces latency spikes if done synchronously.
Memory bloat is the hidden cost. Small allocations backed by 2 MB pages waste enormous amounts of memory. Applications with many small objects (Redis, Node.js, Java with G1GC) are especially vulnerable.
Use madvise mode in production. Let applications explicitly opt in to THP for regions that benefit (large buffer pools, tensor allocations) while protecting the rest of the system from unexpected side effects.
Always measure. THP can deliver 10-35% improvements for the right workloads or cause 2-3x memory bloat and latency spikes for the wrong ones. Benchmark with and without THP, and monitor TLB miss rates and memory usage continuously.