Skip to main content

GPU Memory Hierarchy & Optimization

Summary
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

Interactive GPU Architecture Explorer

Explore modern GPU architecture with interactive 3D visualization, memory access patterns, and kernel configuration:

GPU Memory Hierarchy Overview

Modern GPUs feature a complex memory hierarchy designed to maximize throughput for parallel workloads. Understanding this hierarchy is crucial for achieving peak performance in GPU applications. One powerful technique for reducing memory traffic across these levels is kernel fusion, which combines multiple operations into a single GPU kernel to avoid redundant reads and writes.

Memory Types and Characteristics

1. Registers (Fastest, Private)

Location: On-chip, per-thread Size: 256 KB per SM (65,536 × 32-bit) Latency: 0 cycles (immediate) Bandwidth: ~8 TB/s Scope: Private to each thread

Key Points:

  • Fastest storage in GPU
  • Compiler-managed allocation
  • Spilling to local memory impacts performance
  • Limited to 255 registers per thread

2. Shared Memory (Fast, Shared within Block)

Location: On-chip, per SM Size: 48-228 KB per SM (configurable) Latency: ~20-30 cycles Bandwidth: ~4 TB/s Scope: Shared within thread block

Optimization Strategies:

  • Use for data reuse within blocks
  • Avoid bank conflicts (32 banks, 4-byte width)
  • Implement tiling algorithms
  • Coordinate thread access patterns

3. L1 Cache (Fast, Automatic)

Location: On-chip, per SM Size: 128 KB (combined with shared memory) Latency: ~30-40 cycles Bandwidth: ~4 TB/s Scope: Per SM, caches global/local memory

Configuration Options:

// Configure L1/Shared split cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared); // More shared memory cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferL1); // More L1 cache cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferEqual); // Balanced

4. L2 Cache (Medium, GPU-wide)

Location: On-chip, shared across GPU Size: 40-60 MB (A100: 40 MB, H100: 50 MB) Latency: ~200 cycles Bandwidth: ~2-3 TB/s Scope: All SMs, coherent

Features:

  • Caches all memory accesses
  • Persistent across kernel launches
  • Supports atomic operations
  • Hardware-managed coherency

5. Global Memory (Slow, Large)

Location: Off-chip DRAM/HBM Size: 24-80 GB (HBM2e/HBM3) Latency: ~400-600 cycles Bandwidth: 1-2 TB/s Scope: Entire GPU and host

Memory Coalescing Patterns

Coalesced Access (Optimal)

// Perfect coalescing - consecutive threads access consecutive addresses __global__ void coalesced_access(float* data) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid]; // Thread 0→data[0], Thread 1→data[1], etc. }

Strided Access (Poor)

// Strided pattern - wastes bandwidth __global__ void strided_access(float* data, int stride) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[tid * stride]; // Non-consecutive access }

Random Access (Worst)

// Random pattern - serialized transactions __global__ void random_access(float* data, int* indices) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float value = data[indices[tid]]; // Unpredictable pattern }

Shared Memory Optimization

Bank Conflict Resolution

// Bank conflict example - all threads access same bank __shared__ float shared[32]; float value = shared[threadIdx.x % 4]; // 8-way bank conflict! // Conflict-free access __shared__ float shared[33]; // Padding prevents conflicts float value = shared[threadIdx.x]; // Each thread → different bank

Matrix Transpose with Shared Memory

#define TILE_DIM 32 #define BLOCK_ROWS 8 __global__ void transpose_coalesced(float *odata, float *idata, int width, int height) { __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 padding avoids bank conflicts int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; // Coalesced read from global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < width && (y + j) < height) { tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x]; } } __syncthreads(); // Transpose indices for write x = blockIdx.y * TILE_DIM + threadIdx.x; y = blockIdx.x * TILE_DIM + threadIdx.y; // Coalesced write to global memory for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { if (x < height && (y + j) < width) { odata[(y + j) * height + x] = tile[threadIdx.x][threadIdx.y + j]; } } }

Memory Access Optimization Techniques

1. Texture Memory (Spatial Locality)

// Texture memory for 2D spatial locality texture<float, cudaTextureType2D> texRef; __global__ void texture_kernel(float* output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Hardware-optimized 2D cache output[y * width + x] = tex2D(texRef, x, y); }

2. Constant Memory (Broadcast)

__constant__ float const_data[1024]; // 64 KB constant memory __global__ void const_kernel(float* output) { int tid = threadIdx.x; // All threads read same value - broadcast optimization output[tid] = const_data[0] * tid; }

3. Unified Memory (Simplified Programming)

// Automatic migration between host and device float *data; cudaMallocManaged(&data, size); // Access from both CPU and GPU data[0] = 1.0f; // CPU write kernel<<<grid, block>>>(data); // GPU access cudaDeviceSynchronize(); float result = data[0]; // CPU read

Performance Analysis Tools

Memory Bandwidth Calculation

Effective Bandwidth = (Bytes Transferred / Time) × Occupancy Efficiency = (Effective Bandwidth / Theoretical Bandwidth) × 100%

Profiling Metrics

# NSight Compute memory metrics ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum, l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum, smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct, smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct ./application

Memory Optimization Checklist

Design Phase

  • Identify memory access patterns
  • Calculate memory bandwidth requirements
  • Plan data layout for coalescing
  • Determine shared memory usage

Implementation Phase

  • Implement coalesced access patterns
  • Use shared memory for data reuse
  • Avoid bank conflicts with padding
  • Minimize register pressure
  • Use appropriate memory types (texture, constant)

Optimization Phase

  • Profile memory transactions
  • Analyze bandwidth utilization
  • Identify and fix inefficient patterns
  • Tune L1/shared memory configuration
  • Consider memory prefetching

Advanced Memory Patterns

Warp-level Memory Operations

// Warp shuffle for register-level data exchange __global__ void warp_reduce(float* data) { float value = data[threadIdx.x]; // Warp-level reduction without shared memory for (int mask = 16; mask > 0; mask >>= 1) { value += __shfl_xor_sync(0xffffffff, value, mask); } if (threadIdx.x % 32 == 0) { data[threadIdx.x / 32] = value; } }

Memory-level Parallelism

// Increase memory-level parallelism with unrolling __global__ void mlp_kernel(float* __restrict__ out, const float* __restrict__ in, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int stride = gridDim.x * blockDim.x; // Process multiple elements per thread float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0; for (int i = tid; i < n/4; i += stride) { // Issue multiple loads simultaneously float4 val = reinterpret_cast<const float4*>(in)[i]; sum0 += val.x; sum1 += val.y; sum2 += val.z; sum3 += val.w; } out[tid] = sum0 + sum1 + sum2 + sum3; }

Best Practices Summary

Do's ✅

  • Coalesce global memory accesses
  • Use shared memory for data reuse
  • Pad shared memory arrays to avoid bank conflicts
  • Batch small transfers into larger ones
  • Use vector loads (float2, float4) when possible
  • Profile and measure actual bandwidth

Don'ts ❌

  • Random memory access patterns
  • Excessive register usage causing spills
  • Unaligned memory accesses
  • Frequent host-device transfers
  • Ignoring memory hierarchy
  • Assuming CPU optimization strategies work on GPU

Performance Impact

Bandwidth Utilization by Pattern

Access PatternEfficiencyTransactionsPerformance
Coalesced100%1Optimal
Sequential Misaligned80-100%1-2Good
Strided (stride=2)50%2Poor
Strided (stride=32)3%32Very Poor
Random3-10%32Worst

Conclusion

GPU memory hierarchy optimization is essential for achieving peak performance. Focus on:

  1. Coalesced access patterns for global memory
  2. Shared memory for data reuse and communication
  3. Avoiding bank conflicts through padding
  4. Proper occupancy to hide memory latency
  5. Profiling and measurement to validate optimizations

Master these concepts to unlock the full potential of GPU computing and achieve order-of-magnitude performance improvements in your applications.

GPU & High-Performance Computing
GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

Language & Framework Internals
Pinned Memory and DMA Transfers in PyTorch

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

Systems & Architecture
SoA vs AoS: Data Layout Optimization

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

GPU & High-Performance Computing
CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing
CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing
CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

If you found this explanation helpful, consider sharing it with others.

Mastodon