Skip to main content

Transparent Huge Pages (THP): Reducing TLB Pressure

Learn how Transparent Huge Pages (THP) reduces TLB misses by promoting 4KB to 2MB pages. Understand performance benefits and memory bloat tradeoffs.

Best viewed on desktop for optimal interactive experience

Transparent Huge Pages (THP)

Why THP Matters

Every time your program accesses memory, the CPU must translate a virtual address into a physical one. This translation relies on the Translation Lookaside Buffer (TLB) -- a tiny, fast cache that stores recent address mappings. When the TLB has the answer, translation takes 1-2 cycles. When it does not (a TLB miss), the CPU must walk through up to four levels of page tables in memory, costing 10-20 cycles or more.

Here is the problem: with standard 4 KB pages, even a generous TLB with 1,536 entries only covers about 6 MB of memory. A database with a 10 GB working set will miss the TLB constantly -- over 99.9% of its address space has no TLB entry at any given moment. Those misses add up to a serious performance tax on memory-intensive workloads.

Transparent Huge Pages solve this by automatically using 2 MB pages instead of 4 KB pages. A single TLB entry now covers 512 times more memory, and the page table walk is one level shorter. The "transparent" part means the kernel handles this promotion behind the scenes -- applications do not need to be rewritten. But this convenience comes with real trade-offs around memory fragmentation, latency spikes, and wasted memory that make THP a feature you need to understand, not just enable.

The Page Table Walk Problem

4KB Page Translation: 4-Level Page Walk64-bit Virtual Address16 bits unusedPGD (9b)PUD (9b)PMD (9b)PTE (9b)Offset (12b)Memory Access Flow:PGDPage GlobalDirectory1 accessPUDPage UpperDirectory2 accessPMDPage MiddleDirectory3 accessPTEPage TableEntry4 accessPhysical Page (4KB)+ 12-bit offsetThe Overhead Problem4 memory accesses per translation (PGD → PUD → PMD → PTE)512 entries per table level = 512⁴ = 68 billion possible pages256 TB addressable with 4KB pages (48-bit address space)Each 1GB of RAM needs ~2MB page tables (0.2% overhead)

Modern x86-64 systems use a four-level page table structure. When a TLB miss occurs, the hardware must chase pointers through four successive tables -- PGD, PUD, PMD, and PTE -- each requiring a memory access. With huge pages, the walk stops one level earlier (at the PMD), because the final 21 bits of the address serve as the offset within the 2 MB page rather than indexing into a PTE table.

This shorter walk means fewer memory accesses per miss, and the dramatically improved TLB coverage means fewer misses overall. The combined effect can be substantial for workloads that touch large amounts of memory.

How TLB Coverage Changes

TLB Coverage ComparisonStandard 4KB PagesL1 DTLB: 64 entries64 × 4KB = 256KB coverageL2 STLB: 1536 entries1536 × 4KB = 6MB coverage→ Total: 6MB10GB process memory6MB cached9.994GB requires page walks2MB Huge PagesL1 DTLB: 32 entries32 × 2MB = 64MB coverageL2 STLB: 1536 entries1536 × 2MB = 3GB coverage→ Total: 3GB10GB process memory3GB cached7GB requires page walks512× more coverage per TLB entry • 30-70% fewer TLB misses

The arithmetic is striking. With 4 KB pages and 1,536 TLB entries, you cover 6 MB. With 2 MB pages and the same number of entries, you cover 3 GB -- a 512x improvement from the same hardware. For a server running a database with tens of gigabytes in memory, this transforms the TLB from a resource under constant pressure into one that comfortably covers the working set.

Compare the page walk paths side by side:

4KB Page (4-level walk)PGDLevel 4PUDLevel 3PMDLevel 2PTELevel 14KB Page4 memory accesses2MB Huge Page (3-level walk)PGDLevel 4PUDLevel 3PMDLevel 2PSE bit set!PTE(skipped!)2MB Huge Page512 × 4KB pagescontiguous3 memory accesses (25% faster)One level eliminated

How the Kernel Creates Huge Pages

Linux provides two mechanisms for creating huge pages transparently, each with different performance characteristics.

Synchronous Allocation: The Fast Path

When a process triggers a page fault on a virtual address that is 2 MB-aligned, the kernel checks whether a contiguous 2 MB physical region is available. If one is free, the kernel allocates a huge page immediately. This fast path adds only 0.1-0.5 milliseconds of latency beyond a normal page fault.

If no contiguous region is available, the kernel faces a choice depending on its defragmentation settings: it can either fall back to a standard 4 KB page (fast but forfeits the THP benefit) or attempt to compact memory on the spot to create a contiguous region (which can stall the application for 10-100 milliseconds or more).

Asynchronous Promotion: khugepaged

The kernel runs a background daemon called khugepaged that periodically scans memory looking for opportunities to collapse 512 contiguous 4 KB pages into a single 2 MB huge page. This is the gentler path -- it happens in the background without stalling the application, but it takes time. A region might exist as small pages for seconds or minutes before khugepaged gets around to promoting it.

khugepaged Page Collapse Operation

Before: 512 × 4KB pages

PMD points to Page Table with 512 PTEs, each pointing to scattered 4KB physical pages

khugepaged scans and collapses →

After: 1 × 2MB huge page

PMD points directly to 2MB contiguous physical page (PSE=1)

Benefits:

  • ✓ Page table memory freed: 4KB saved (512 PTEs × 8 bytes)
  • ✓ TLB coverage: 512× increase (1 TLB entry instead of 512)
  • ✓ Page walk cost: 25% reduction (3 levels instead of 4)
  • ⚠ Migration harder: must move 2MB contiguous block

The daemon is configurable: you can control how frequently it scans, how many pages it examines per scan, and how aggressively it works. Conservative settings (scanning every 10 seconds, examining 4,096 pages per pass) minimize background CPU overhead. Aggressive settings (scanning every second, examining 8,192 pages) achieve higher THP coverage at the cost of more background work.

The Fragmentation Challenge

Physical memory fragmentation is the central obstacle to THP. Over time, as processes allocate and free memory in varying sizes, the physical address space becomes a patchwork of used and free regions. Finding 512 contiguous free pages (2 MB) in a fragmented address space can be difficult or impossible without moving existing pages around.

Memory Fragmentation and THP Allocation

Fragmented Memory (THP fails)

Process A
Process B

✗ No 2MB contiguous region available

After Defragmentation

✓ 2MB contiguous region created

THP allocation succeeds for Process A

Defragmentation Process

kcompactd/kswapd migrate pages to create contiguous regions. May require reclaiming page cache or moving active pages.

Defragmentation Modes

The kernel offers several strategies for handling the situation when a contiguous region is not available. Each represents a different trade-off between THP coverage and latency predictability.

ModeBehaviorLatency RiskBest For
alwaysSynchronously compacts memory on every THP allocation attemptSevere (10-100ms stalls)Benchmarking only
deferHands off to a background daemon (kcompactd)None for the applicationGeneral-purpose servers
defer+madviseBackground for most allocations; synchronous for regions the application explicitly requestsLow, controlledDatabases using madvise
madviseOnly attempts THP for regions the application explicitly opts intoNone for non-opted regionsProduction (recommended)
neverNo defragmentation at allZeroReal-time or latency-critical systems

The "always" mode sounds appealing but is dangerous in production. When the kernel stalls an application thread for 50-100 milliseconds to compact memory, database query latencies spike, web requests time out, and tail latencies become unpredictable. The "madvise" mode is the widely recommended production choice because it lets informed applications opt in to THP for their large, long-lived allocations while protecting everything else from unexpected stalls.

Where THP Helps Most

THP Performance Impact by Workload Type

Database
+30%
ML Train
+12%
Analytics
+25%
Video Enc
+18%
Memcpy
+50%
Web Fork
-28%
Sparse
-22%
Speedup
Slowdown

THP delivers its greatest benefits to workloads that combine large memory footprints with dense, sequential access patterns. These are workloads where TLB pressure is a genuine bottleneck rather than a minor overhead.

Databases like PostgreSQL, MongoDB, and Redis maintain large buffer pools (often 10 GB or more) and perform sequential scans over huge tables. THP can reduce TLB misses by 60-70% and improve throughput by 15-35%.

Machine learning training involves large, contiguous tensor allocations that are accessed sequentially during forward and backward passes. THP typically provides a 5-15% training speedup by improving cache line utilization and reducing page walk overhead.

In-memory analytics engines like Spark and ClickHouse process billions of rows in columnar data structures. The combination of large allocations and sequential scans makes them ideal THP candidates, with measured query speedups of 20-40%.

WorkloadTypical ImprovementTLB Miss ReductionRecommended Mode
PostgreSQL+30% throughput70%madvise
MongoDB+25% throughput65%madvise
Redis+31% throughput68%madvise
ML Training (PyTorch)+9% throughput45%defer
Spark Analytics+35% throughput72%madvise
Video Encoding+18% throughput50%defer

The Dark Side: Memory Bloat

THP's biggest hidden cost is internal fragmentation. When an application allocates a small amount of memory -- say, 100 bytes -- the kernel may back the entire 2 MB huge page for that allocation. The remaining 2,097,052 bytes are wasted.

Internal Fragmentation: Memory Bloat with THP

Without THP (4KB pages)

Process allocates 100 × 5KB objects

5KB
5KB
5KB

... × 100

Total: 500KB used

Each 5KB uses 2 × 4KB pages = 8KB

100 × 8KB = 800KB total RSS

Waste: 300KB (37.5%)

With THP enabled (2MB pages)

Same 100 × 5KB objects

2MB Huge Page

500KB used

1.5MB wasted!

Total: 2MB allocated

Kernel allocates full 2MB huge page

Only 500KB used = 2048KB total RSS

Waste: 1548KB (75.6%!) - 2.5× memory bloat!

Memory Bloat Risk

Workloads with many small allocations can see 2-4× memory usage increase with THP!

This is particularly damaging for applications that allocate many small objects. Redis, for example, stores millions of small key-value pairs (50-100 bytes each). With THP in "always" mode, each small allocation can trigger a 2 MB huge page, inflating memory usage by 2-3x. Node.js applications with their many small JavaScript objects see similar bloat. Java applications using the G1 garbage collector, which works with 1 MB regions, waste roughly 50% of each 2 MB huge page.

The bloat is not just about wasted RAM. On memory-constrained systems, it can trigger out-of-memory kills. And when the system needs to swap, it must move entire 2 MB chunks to disk rather than surgical 4 KB pages, amplifying I/O pressure.

When THP Hurts Performance

Beyond memory bloat, THP actively degrades performance in several scenarios:

Fork-heavy workloads suffer because fork() must handle huge pages during copy-on-write. Each 2 MB page that gets written after a fork must be copied in full, making fork operations 2-3x slower. This matters for web servers using prefork models (like Apache) and for Redis, which forks for background persistence.

Sparse memory access patterns waste the effort of loading 2 MB pages when only 4 KB within each page is ever touched. The page fault brings in a full 2 MB from physical memory, consuming bandwidth and cache space for data that will never be read.

Rapid allocation and deallocation cycles (common in memory allocators handling many short-lived objects) prevent khugepaged from ever successfully promoting pages. The overhead of scanning and attempting promotion is spent with no benefit.

Choosing the Right Configuration

The right THP configuration depends on your workload characteristics:

WorkloadTHP EnabledDefrag Modekhugepaged
Database (large buffer pool)madvisemadviseAggressive
Web server (fork-based)neverneverDisabled
ML trainingdeferdeferModerate
Redis / MemcachedmadvisemadviseModerate
Java (large heap, G1GC)madvisedeferConservative
Real-time / latency-criticalneverneverDisabled

For most production systems, the safest starting point is madvise mode for both the enabled and defrag settings. This gives applications that understand their memory patterns the ability to opt in (by calling madvise(MADV_HUGEPAGE) on their large, long-lived allocations) while preventing the kernel from making autonomous decisions that cause bloat or latency spikes.

Applications that want to opt out of THP for specific regions (such as small-object heaps) can use madvise(MADV_NOHUGEPAGE) to explicitly prevent promotion.

Key Takeaways

  1. THP exists to solve TLB pressure. With 4 KB pages, a 1,536-entry TLB covers only 6 MB. With 2 MB pages, the same TLB covers 3 GB -- a 512x improvement that dramatically reduces expensive page table walks.

  2. The kernel offers two promotion paths. Synchronous allocation is fast when contiguous memory is available; the khugepaged daemon promotes pages in the background over time. Both are transparent to the application.

  3. Fragmentation is the fundamental challenge. Finding 512 contiguous physical pages becomes harder as memory ages. Defragmentation can solve this but introduces latency spikes if done synchronously.

  4. Memory bloat is the hidden cost. Small allocations backed by 2 MB pages waste enormous amounts of memory. Applications with many small objects (Redis, Node.js, Java with G1GC) are especially vulnerable.

  5. Use madvise mode in production. Let applications explicitly opt in to THP for regions that benefit (large buffer pools, tensor allocations) while protecting the rest of the system from unexpected side effects.

  6. Always measure. THP can deliver 10-35% improvements for the right workloads or cause 2-3x memory bloat and latency spikes for the wrong ones. Benchmark with and without THP, and monitor TLB miss rates and memory usage continuously.

If you found this explanation helpful, consider sharing it with others.

Mastodon