A cache line is the fundamental unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn’t fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors.
This design exploits spatial locality: if you access a memory location, you’re likely to access nearby locations soon.
Performance Impact: The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x!
Interactive Cache Line Visualization
See exactly how different memory access patterns interact with 64-byte cache lines:
How Caches Work
Sets, Ways, and Tags
A cache is organized as a table of sets, each containing one or more ways (slots). When the CPU accesses an address, the cache hardware splits it into three fields:
- Offset (lowest bits): which byte within the cache line (for a 64-byte line, bits 0–5)
- Index (middle bits): which set to look in
- Tag (highest bits): identifies which memory block is stored in this slot
Address: [ Tag | Index | Offset ] high bits mid bits low bits
Associativity
- Direct-mapped (1-way): each memory block maps to exactly one set. Fast but high conflict miss rate.
- 2-way set associative: each block can go in either of 2 slots. Reduces conflicts.
- 4/8-way: more flexibility, fewer conflicts, more tag comparisons.
- Fully associative: any block can go anywhere. Best hit rate, impractical for large caches.
Replacement Policies
When a set is full and a new block must be loaded:
- LRU (Least Recently Used): evict the block unused longest. Best hit rate, expensive to track.
- FIFO (First In, First Out): evict the oldest. Simpler, can evict hot blocks.
- Random: evict randomly. Surprisingly competitive with LRU for large caches.
Write Policies
- Write-back: writes update only cache. Dirty lines written to memory on eviction. Fewer writes, needs dirty bits.
- Write-through: every write updates both cache and memory. Simpler, more traffic.
- Write-allocate: on write miss, load block first then write. Pairs with write-back.
- No-write-allocate: on write miss, write to memory without loading. Pairs with write-through.
Cache Simulator
Configure cache parameters and step through memory operations to see hits, misses, evictions, and writebacks. Toggle multi-level to watch misses cascade through L1 → L2 → L3 → Memory.
Multi-Level Cache Hierarchy
Modern CPUs use three cache levels, trading speed for capacity:
| Level | Size | Latency | Associativity | Shared? |
|---|---|---|---|---|
| L1 | 32–64 KB | ~1 ns (4 cycles) | 8-way | Per core |
| L2 | 256 KB–1 MB | ~5 ns (12 cycles) | 8-way | Per core |
| L3 | 8–64 MB | ~20 ns (40 cycles) | 16-way | All cores |
| Memory | 16–256 GB | ~100 ns (200 cycles) | N/A | All cores |
Inclusion Policies
- Inclusive: L2 contains a superset of L1. Simple coherency, wastes L2 space.
- Exclusive: L1 and L2 hold different data. More effective capacity.
- NINE (Non-Inclusive Non-Exclusive): Intel since Skylake. No strict relationship.
Why Cache Lines Exist
The Memory Hierarchy Gap
Modern CPUs execute instructions in less than a nanosecond, but accessing main memory takes 60–100 nanoseconds. That’s a 100x difference!
Cache lines help bridge this gap by:
- Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than 8 bytes
- Exploiting Spatial Locality: Programs often access nearby data
- Enabling Prefetching: Hardware can predict and load future cache lines
- Maximizing Memory Bandwidth: Efficient use of memory bus width
Cache Line Size: Why 64 Bytes?
Modern x86-64 processors use 64-byte cache lines:
- 8 × 64-bit integers or doubles
- 16 × 32-bit integers or floats
- 64 × 8-bit characters
This size balances:
- Transfer efficiency: Larger lines amortize memory latency
- Cache pollution: Smaller lines waste less space on unused data
- False sharing: Larger lines increase conflict probability
Access Pattern Impact
Sequential Access (Best Case)
Efficiency: Near 100% of transferred bytes are used
- Uses all 8 elements per cache line
- Hardware prefetchers excel with sequential patterns
- Maximum memory bandwidth utilization
Strided Access (Poor)
Efficiency: Only 12.5% with stride-8
- Uses only 1 element per 64-byte cache line
- Prefetcher struggles with large strides
- 8x more memory traffic than necessary
Random Access (Worst Case)
Efficiency: Typically 1–2 elements per cache line
- Defeats spatial locality entirely
- Prefetcher cannot predict random patterns
- Maximum cache miss rate
Loop Tiling
When processing a 2D array column-by-column, each access hits a different cache line. Loop tiling processes data in cache-sized blocks:
// Bad: column-major traversal on row-major array for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += matrix[i][j]; // stride = N*sizeof(int) // Good: tiled traversal const int TILE = 64; for (int ii = 0; ii < N; ii += TILE) for (int jj = 0; jj < N; jj += TILE) for (int i = ii; i < min(ii+TILE, N); i++) for (int j = jj; j < min(jj+TILE, N); j++) sum += matrix[i][j]; // stays in L1
False Sharing
False sharing is the most insidious cache performance bug. Two threads access different variables that share the same cache line. Each write invalidates the other core’s copy, causing constant cache-to-cache transfers.
Detecting False Sharing
# Linux perf c2c: look for HITM events perf c2c record ./my_program perf c2c report # High HITM count = false sharing # The report shows contended cache lines and accessing code
Fixing False Sharing
// alignas forces each counter to its own cache line struct alignas(64) PaddedCounter { std::atomic<int> value; }; PaddedCounter counters[NUM_THREADS];
Data Structure Layout
Array of Structs (AoS): Poor when accessing single fields
- Each element may span multiple cache lines
- Wastes bandwidth on unused fields
Struct of Arrays (SoA): Good for single field access
- Sequential access to each field
- Full cache line utilization
Measuring Cache Performance
perf stat
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./program Performance counter stats for './program': 1,234,567 L1-dcache-load-misses # 2.5% of all L1 loads 49,382,716 L1-dcache-loads 12,345 LLC-load-misses # 0.8% of all LLC loads 1,543,210 LLC-loads
cachegrind (Valgrind)
$ valgrind --tool=cachegrind ./program $ cg_annotate cachegrind.out.12345 # Shows per-line cache miss counts in your source code
Best Practices
- Access Memory Sequentially: Design algorithms for linear access patterns
- Keep Working Sets Small: Fit hot data in L1/L2 cache
- Align Data Structures: Respect cache line boundaries
- Avoid False Sharing: Pad shared data structures to 64 bytes
- Profile Real Workloads: Cache behavior varies by data
Key Takeaways
-
Memory moves in 64-byte cache lines — accessing one byte loads 63 neighbors for free. Design data layouts to exploit this.
-
Sequential access is 10-100x faster than random — spatial locality gives >90% hit rate even for huge arrays.
-
False sharing kills multi-threaded performance — pad shared data to 64-byte boundaries. Use
perf c2cto detect. -
Cache associativity reduces conflict misses — direct-mapped caches have pathological conflicts. 4-8 way eliminates most.
-
Profile before optimizing —
perf statshows L1/LLC miss rates. If L1 miss rate < 5%, cache isn’t your bottleneck.
Related Concepts
- Memory Access Patterns: How patterns affect cache efficiency
- Virtual Memory: Page-level memory management
- Memory Interleaving: Address mapping to banks
- NUMA Architecture: Cache coherency across sockets
Further Reading
- UW CSE 351 Cache Simulator - Interactive simulator where you configure cache capacity, block size, associativity, and replacement policy, then step through memory accesses to see hits and misses in real time
- What Every Programmer Should Know About Memory - Ulrich Drepper’s comprehensive guide covering cache architecture, NUMA, and memory optimization
- Gallery of Processor Cache Effects - Visual demonstrations of cache behavior with benchmark code
Related Concepts
Deepen your understanding with these interconnected concepts
