Skip to main content

CPU Cache Lines: The Complete Guide with Interactive Simulator

Deep dive into CPU cache lines — interactive cache simulator with configurable associativity and replacement policies, false sharing MESI protocol visualization, access pattern benchmarks, and optimization techniques.

22 min|memorycacheperformancehardware
Best viewed on desktop for optimal interactive experience

A cache line is the fundamental unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn’t fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors.

This design exploits spatial locality: if you access a memory location, you’re likely to access nearby locations soon.

Performance Impact: The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x!

Interactive Cache Line Visualization

See exactly how different memory access patterns interact with 64-byte cache lines:

How Caches Work

Sets, Ways, and Tags

A cache is organized as a table of sets, each containing one or more ways (slots). When the CPU accesses an address, the cache hardware splits it into three fields:

  • Offset (lowest bits): which byte within the cache line (for a 64-byte line, bits 0–5)
  • Index (middle bits): which set to look in
  • Tag (highest bits): identifies which memory block is stored in this slot
Address: [ Tag | Index | Offset ] high bits mid bits low bits

Associativity

  • Direct-mapped (1-way): each memory block maps to exactly one set. Fast but high conflict miss rate.
  • 2-way set associative: each block can go in either of 2 slots. Reduces conflicts.
  • 4/8-way: more flexibility, fewer conflicts, more tag comparisons.
  • Fully associative: any block can go anywhere. Best hit rate, impractical for large caches.

Replacement Policies

When a set is full and a new block must be loaded:

  • LRU (Least Recently Used): evict the block unused longest. Best hit rate, expensive to track.
  • FIFO (First In, First Out): evict the oldest. Simpler, can evict hot blocks.
  • Random: evict randomly. Surprisingly competitive with LRU for large caches.

Write Policies

  • Write-back: writes update only cache. Dirty lines written to memory on eviction. Fewer writes, needs dirty bits.
  • Write-through: every write updates both cache and memory. Simpler, more traffic.
  • Write-allocate: on write miss, load block first then write. Pairs with write-back.
  • No-write-allocate: on write miss, write to memory without loading. Pairs with write-through.

Cache Simulator

Configure cache parameters and step through memory operations to see hits, misses, evictions, and writebacks. Toggle multi-level to watch misses cascade through L1 → L2 → L3 → Memory.

Multi-Level Cache Hierarchy

Modern CPUs use three cache levels, trading speed for capacity:

LevelSizeLatencyAssociativityShared?
L132–64 KB~1 ns (4 cycles)8-wayPer core
L2256 KB–1 MB~5 ns (12 cycles)8-wayPer core
L38–64 MB~20 ns (40 cycles)16-wayAll cores
Memory16–256 GB~100 ns (200 cycles)N/AAll cores

Inclusion Policies

  • Inclusive: L2 contains a superset of L1. Simple coherency, wastes L2 space.
  • Exclusive: L1 and L2 hold different data. More effective capacity.
  • NINE (Non-Inclusive Non-Exclusive): Intel since Skylake. No strict relationship.

Why Cache Lines Exist

The Memory Hierarchy Gap

Modern CPUs execute instructions in less than a nanosecond, but accessing main memory takes 60–100 nanoseconds. That’s a 100x difference!

Cache lines help bridge this gap by:

  1. Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than 8 bytes
  2. Exploiting Spatial Locality: Programs often access nearby data
  3. Enabling Prefetching: Hardware can predict and load future cache lines
  4. Maximizing Memory Bandwidth: Efficient use of memory bus width

Cache Line Size: Why 64 Bytes?

Modern x86-64 processors use 64-byte cache lines:

  • 8 × 64-bit integers or doubles
  • 16 × 32-bit integers or floats
  • 64 × 8-bit characters

This size balances:

  • Transfer efficiency: Larger lines amortize memory latency
  • Cache pollution: Smaller lines waste less space on unused data
  • False sharing: Larger lines increase conflict probability

Access Pattern Impact

Sequential Access (Best Case)

Efficiency: Near 100% of transferred bytes are used

  • Uses all 8 elements per cache line
  • Hardware prefetchers excel with sequential patterns
  • Maximum memory bandwidth utilization

Strided Access (Poor)

Efficiency: Only 12.5% with stride-8

  • Uses only 1 element per 64-byte cache line
  • Prefetcher struggles with large strides
  • 8x more memory traffic than necessary

Random Access (Worst Case)

Efficiency: Typically 1–2 elements per cache line

  • Defeats spatial locality entirely
  • Prefetcher cannot predict random patterns
  • Maximum cache miss rate

Loop Tiling

When processing a 2D array column-by-column, each access hits a different cache line. Loop tiling processes data in cache-sized blocks:

// Bad: column-major traversal on row-major array for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += matrix[i][j]; // stride = N*sizeof(int) // Good: tiled traversal const int TILE = 64; for (int ii = 0; ii < N; ii += TILE) for (int jj = 0; jj < N; jj += TILE) for (int i = ii; i < min(ii+TILE, N); i++) for (int j = jj; j < min(jj+TILE, N); j++) sum += matrix[i][j]; // stays in L1

False Sharing

False sharing is the most insidious cache performance bug. Two threads access different variables that share the same cache line. Each write invalidates the other core’s copy, causing constant cache-to-cache transfers.

Detecting False Sharing

# Linux perf c2c: look for HITM events perf c2c record ./my_program perf c2c report # High HITM count = false sharing # The report shows contended cache lines and accessing code

Fixing False Sharing

// alignas forces each counter to its own cache line struct alignas(64) PaddedCounter { std::atomic<int> value; }; PaddedCounter counters[NUM_THREADS];

Data Structure Layout

Array of Structs (AoS): Poor when accessing single fields

  • Each element may span multiple cache lines
  • Wastes bandwidth on unused fields

Struct of Arrays (SoA): Good for single field access

  • Sequential access to each field
  • Full cache line utilization

Measuring Cache Performance

perf stat

$ perf stat -e L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./program Performance counter stats for './program': 1,234,567 L1-dcache-load-misses # 2.5% of all L1 loads 49,382,716 L1-dcache-loads 12,345 LLC-load-misses # 0.8% of all LLC loads 1,543,210 LLC-loads

cachegrind (Valgrind)

$ valgrind --tool=cachegrind ./program $ cg_annotate cachegrind.out.12345 # Shows per-line cache miss counts in your source code

Best Practices

  1. Access Memory Sequentially: Design algorithms for linear access patterns
  2. Keep Working Sets Small: Fit hot data in L1/L2 cache
  3. Align Data Structures: Respect cache line boundaries
  4. Avoid False Sharing: Pad shared data structures to 64 bytes
  5. Profile Real Workloads: Cache behavior varies by data

Key Takeaways

  1. Memory moves in 64-byte cache lines — accessing one byte loads 63 neighbors for free. Design data layouts to exploit this.

  2. Sequential access is 10-100x faster than random — spatial locality gives >90% hit rate even for huge arrays.

  3. False sharing kills multi-threaded performance — pad shared data to 64-byte boundaries. Use perf c2c to detect.

  4. Cache associativity reduces conflict misses — direct-mapped caches have pathological conflicts. 4-8 way eliminates most.

  5. Profile before optimizingperf stat shows L1/LLC miss rates. If L1 miss rate < 5%, cache isn’t your bottleneck.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon