What Are Memory Access Patterns?
Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help.
Key Insight: The difference between optimal and suboptimal patterns can be 10x or more in performance!
Interactive Visualization
Experience the dramatic performance difference between sequential and strided memory access patterns:
Why Access Patterns Matter
The Memory Hierarchy Gap
Modern computers have a multi-level memory hierarchy:
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1 Cache | 32-64 KB | 1-4 cycles | 3+ TB/s |
| L2 Cache | 256-512 KB | 10-20 cycles | 1+ TB/s |
| L3 Cache | 8-32 MB | 30-70 cycles | 500+ GB/s |
| Main Memory | 8-64 GB | 100-300 cycles | 50-100 GB/s |
The Gap: Accessing data from cache is 100x faster than main memory!
Cache Lines: The Unit of Transfer
- Memory transfers in 64-byte cache lines
- Loading one byte loads the entire 64-byte line
- Spatial locality determines whether those 64 bytes are useful
Sequential vs Strided Access
The same loop body can run 10x slower purely because of how it walks memory. Sequential access uses every byte the cache fetches; strided access throws most of it away.
Common Patterns
| Scenario | Sequential-friendly layout | Strided / poor layout |
|---|---|---|
| Matrix traversal | Row-major — consecutive in memory | Column-major — strided by row width |
| Struct access (single field) | Struct of Arrays (SoA) — sequential per field | Array of Structs (AoS) — strided per field |
Hardware Prefetching
Modern CPUs include sophisticated prefetchers:
What They Do:
- Detect access patterns (sequential, stride, stream)
- Load data into cache before it's needed
- Multiple prefetch units (L1, L2, L3)
- Adaptive learning of patterns
| Prefetcher-friendly | Prefetcher-unfriendly |
|---|---|
| Sequential access | Random access |
| Fixed stride (if not too large) | Large irregular strides |
| Stream processing | Pointer chasing |
| Linear traversal | Hash-table lookups |
Optimization Strategies
| Area | Techniques |
|---|---|
| Data structure design | Contiguous arrays; SoA for partial field access; align critical data to cache-line boundaries |
| Algorithm design | Cache-friendly traversal order; block/tile matrix algorithms; minimize working-set size |
| Loop optimization | Interchange loops for sequential access; tile/block for locality; manual prefetch for irregular patterns |
Measuring Performance
Key Metrics
| Metric | Definition |
|---|---|
| Cache hit rate | HitsTotal\Accesses × 100 |
| Memory bandwidth | Bytes transferred per second |
| Cache-line utilization | Useful bytes / 64 bytes |
| Prefetch accuracy | Useful prefetches / total prefetches |
Tools
- Linux perf:
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program - Intel VTune:
vtune -collect memory-access ./program
Best Practices
- Design for Sequential Access: Arrange data structures for linear traversal
- Minimize Stride: Keep related data close together
- Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
- Profile Real Workloads: Memory patterns vary by input
- Consider NUMA Effects: Access patterns affect NUMA systems differently
Conclusion
Memory access patterns can make or break performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching for maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. Understanding these patterns through visual exploration enables 10x+ performance improvements without algorithmic changes.
Related concepts
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
Deep dive into CPU cache lines — interactive cache simulator with configurable associativity and replacement policies, false sharing MESI protocol visualization, access pattern benchmarks, and optimization techniques.
Learn how Transparent Huge Pages (THP) reduces TLB misses by promoting 4KB to 2MB pages. Understand performance benefits and memory bloat tradeoffs.
Explore CPU pipeline stages, instruction-level parallelism, pipeline hazards, and branch prediction through interactive visualizations.
Master pipeline hazards through interactive visualizations of data dependencies, control hazards, structural conflicts, and advanced detection mechanisms.
Discover how memory interleaving distributes addresses across banks for parallel access. Boost memory bandwidth in DDR5 and GPU systems.
