Skip to main content

Memory Access Patterns: Sequential vs Strided

Summary
Master sequential vs strided memory access patterns. Learn how cache efficiency and hardware prefetching affect application performance.

What Are Memory Access Patterns?

Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help.

Key Insight: The difference between optimal and suboptimal patterns can be 10x or more in performance!

Interactive Visualization

Experience the dramatic performance difference between sequential and strided memory access patterns:

Why Access Patterns Matter

The Memory Hierarchy Gap

Modern computers have a multi-level memory hierarchy:

LevelSizeLatencyBandwidth
L1 Cache32-64 KB1-4 cycles3+ TB/s
L2 Cache256-512 KB10-20 cycles1+ TB/s
L3 Cache8-32 MB30-70 cycles500+ GB/s
Main Memory8-64 GB100-300 cycles50-100 GB/s

The Gap: Accessing data from cache is 100x faster than main memory!

Cache Lines: The Unit of Transfer

  • Memory transfers in 64-byte cache lines
  • Loading one byte loads the entire 64-byte line
  • Spatial locality determines whether those 64 bytes are useful

Sequential vs Strided Access

The same loop body can run 10x slower purely because of how it walks memory. Sequential access uses every byte the cache fetches; strided access throws most of it away.

Sequentialoptimal
Stridedsuboptimal
Pattern
Consecutive memory locations
Fixed-stride jumps through memory
Spatial locality
Uses all 64 bytes of each cache line
Loads 64 bytes, uses only a few
Cache hit rate
~87.5% (7 hits per 8 accesses)
Collapses — often a new line per access
Prefetcher
Pattern is predicted; data loaded ahead
Large strides defeat prediction
Bandwidth
Every byte transferred is used
Wastes up to 87.5% (stride-8)

Common Patterns

ScenarioSequential-friendly layoutStrided / poor layout
Matrix traversalRow-major — consecutive in memoryColumn-major — strided by row width
Struct access (single field)Struct of Arrays (SoA) — sequential per fieldArray of Structs (AoS) — strided per field

Hardware Prefetching

Modern CPUs include sophisticated prefetchers:

What They Do:

  1. Detect access patterns (sequential, stride, stream)
  2. Load data into cache before it's needed
  3. Multiple prefetch units (L1, L2, L3)
  4. Adaptive learning of patterns
Prefetcher-friendlyPrefetcher-unfriendly
Sequential accessRandom access
Fixed stride (if not too large)Large irregular strides
Stream processingPointer chasing
Linear traversalHash-table lookups

Optimization Strategies

AreaTechniques
Data structure designContiguous arrays; SoA for partial field access; align critical data to cache-line boundaries
Algorithm designCache-friendly traversal order; block/tile matrix algorithms; minimize working-set size
Loop optimizationInterchange loops for sequential access; tile/block for locality; manual prefetch for irregular patterns

Measuring Performance

Key Metrics

MetricDefinition
Cache hit rateHitsTotal\Accesses × 100
Memory bandwidthBytes transferred per second
Cache-line utilizationUseful bytes / 64 bytes
Prefetch accuracyUseful prefetches / total prefetches

Tools

  • Linux perf: perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
  • Intel VTune: vtune -collect memory-access ./program

Best Practices

  1. Design for Sequential Access: Arrange data structures for linear traversal
  2. Minimize Stride: Keep related data close together
  3. Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
  4. Profile Real Workloads: Memory patterns vary by input
  5. Consider NUMA Effects: Access patterns affect NUMA systems differently

Conclusion

Memory access patterns can make or break performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching for maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. Understanding these patterns through visual exploration enables 10x+ performance improvements without algorithmic changes.

If you found this explanation helpful, consider sharing it with others.

Mastodon