GPU Memory Hierarchy & Optimization

Interactive GPU Architecture Explorer

Explore modern GPU architecture with interactive 3D visualization, memory access patterns, and kernel configuration:

GPU Memory Hierarchy Overview

Modern GPUs feature a complex memory hierarchy designed to maximize throughput for parallel workloads. Understanding this hierarchy is crucial for achieving peak performance in GPU applications. One powerful technique for reducing memory traffic across these levels is kernel fusion, which combines multiple operations into a single GPU kernel to avoid redundant reads and writes.

Memory Types and Characteristics

1. Registers (Fastest, Private)

Location: On-chip, per-thread
Size: 256 KB per SM (65,536 × 32-bit)
Latency: 0 cycles (immediate)
Bandwidth: ~8 TB/s
Scope: Private to each thread

Key Points:

Fastest storage in GPU
Compiler-managed allocation
Spilling to local memory impacts performance
Limited to 255 registers per thread

2. Shared Memory (Fast, Shared within Block)

Location: On-chip, per SM
Size: 48-228 KB per SM (configurable)
Latency: ~20-30 cycles
Bandwidth: ~4 TB/s
Scope: Shared within thread block

Optimization Strategies:

Use for data reuse within blocks
Avoid bank conflicts (32 banks, 4-byte width)
Implement tiling algorithms
Coordinate thread access patterns

3. L1 Cache (Fast, Automatic)

Location: On-chip, per SM
Size: 128 KB (combined with shared memory)
Latency: ~30-40 cycles
Bandwidth: ~4 TB/s
Scope: Per SM, caches global/local memory

Configuration Options:

// Configure L1/Shared split
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferShared);  // More shared memory
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferL1);      // More L1 cache
cudaFuncSetCacheConfig(kernel, cudaFuncCachePreferEqual);   // Balanced

4. L2 Cache (Medium, GPU-wide)

Location: On-chip, shared across GPU
Size: 40-60 MB (A100: 40 MB, H100: 50 MB)
Latency: ~200 cycles
Bandwidth: ~2-3 TB/s
Scope: All SMs, coherent

Features:

Caches all memory accesses
Persistent across kernel launches
Supports atomic operations
Hardware-managed coherency

5. Global Memory (Slow, Large)

Location: Off-chip DRAM/HBM
Size: 24-80 GB (HBM2e/HBM3)
Latency: ~400-600 cycles
Bandwidth: 1-2 TB/s
Scope: Entire GPU and host

Memory Coalescing Patterns

Coalesced Access (Optimal)

// Perfect coalescing - consecutive threads access consecutive addresses
__global__ void coalesced_access(float* data) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[tid];  // Thread 0→data[0], Thread 1→data[1], etc.
}

Strided Access (Poor)

// Strided pattern - wastes bandwidth
__global__ void strided_access(float* data, int stride) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[tid * stride];  // Non-consecutive access
}

Random Access (Worst)

// Random pattern - serialized transactions
__global__ void random_access(float* data, int* indices) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float value = data[indices[tid]];  // Unpredictable pattern
}

Shared Memory Optimization

Bank Conflict Resolution

// Bank conflict example - all threads access same bank
__shared__ float shared[32];
float value = shared[threadIdx.x % 4];  // 8-way bank conflict!

// Conflict-free access
__shared__ float shared[33];  // Padding prevents conflicts
float value = shared[threadIdx.x];  // Each thread → different bank

Matrix Transpose with Shared Memory

#define TILE_DIM 32
#define BLOCK_ROWS 8

__global__ void transpose_coalesced(float *odata, float *idata, int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];  // +1 padding avoids bank conflicts

    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;

    // Coalesced read from global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < width && (y + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = idata[(y + j) * width + x];
        }
    }

    __syncthreads();

    // Transpose indices for write
    x = blockIdx.y * TILE_DIM + threadIdx.x;
    y = blockIdx.x * TILE_DIM + threadIdx.y;

    // Coalesced write to global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (x < height && (y + j) < width) {
            odata[(y + j) * height + x] = tile[threadIdx.x][threadIdx.y + j];
        }
    }
}

Memory Access Optimization Techniques

1. Texture Memory (Spatial Locality)

// Texture memory for 2D spatial locality
texture<float, cudaTextureType2D> texRef;

__global__ void texture_kernel(float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Hardware-optimized 2D cache
    output[y * width + x] = tex2D(texRef, x, y);
}

2. Constant Memory (Broadcast)

__constant__ float const_data[1024];  // 64 KB constant memory

__global__ void const_kernel(float* output) {
    int tid = threadIdx.x;
    // All threads read same value - broadcast optimization
    output[tid] = const_data[0] * tid;
}

3. Unified Memory (Simplified Programming)

// Automatic migration between host and device
float *data;
cudaMallocManaged(&data, size);

// Access from both CPU and GPU
data[0] = 1.0f;  // CPU write
kernel<<<grid, block>>>(data);  // GPU access
cudaDeviceSynchronize();
float result = data[0];  // CPU read

Performance Analysis Tools

Memory Bandwidth Calculation

Effective Bandwidth = (Bytes Transferred / Time) × Occupancy
Efficiency = (Effective Bandwidth / Theoretical Bandwidth) × 100%

Profiling Metrics

# NSight Compute memory metrics
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,
               l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,
               smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct,
               smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct
               ./application

Memory Optimization Checklist

Design Phase

Identify memory access patterns
Calculate memory bandwidth requirements
Plan data layout for coalescing
Determine shared memory usage

Implementation Phase

Implement coalesced access patterns
Use shared memory for data reuse
Avoid bank conflicts with padding
Minimize register pressure
Use appropriate memory types (texture, constant)

Optimization Phase

Profile memory transactions
Analyze bandwidth utilization
Identify and fix inefficient patterns
Tune L1/shared memory configuration
Consider memory prefetching

Advanced Memory Patterns

Warp-level Memory Operations

// Warp shuffle for register-level data exchange
__global__ void warp_reduce(float* data) {
    float value = data[threadIdx.x];

    // Warp-level reduction without shared memory
    for (int mask = 16; mask > 0; mask >>= 1) {
        value += __shfl_xor_sync(0xffffffff, value, mask);
    }

    if (threadIdx.x % 32 == 0) {
        data[threadIdx.x / 32] = value;
    }
}

Memory-level Parallelism

// Increase memory-level parallelism with unrolling
__global__ void mlp_kernel(float* __restrict__ out,
                           const float* __restrict__ in,
                           int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Process multiple elements per thread
    float sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;

    for (int i = tid; i < n/4; i += stride) {
        // Issue multiple loads simultaneously
        float4 val = reinterpret_cast<const float4*>(in)[i];
        sum0 += val.x;
        sum1 += val.y;
        sum2 += val.z;
        sum3 += val.w;
    }

    out[tid] = sum0 + sum1 + sum2 + sum3;
}

Best Practices Summary

Do's ✅

Coalesce global memory accesses
Use shared memory for data reuse
Pad shared memory arrays to avoid bank conflicts
Batch small transfers into larger ones
Use vector loads (float2, float4) when possible
Profile and measure actual bandwidth

Don'ts ❌

Random memory access patterns
Excessive register usage causing spills
Unaligned memory accesses
Frequent host-device transfers
Ignoring memory hierarchy
Assuming CPU optimization strategies work on GPU

Performance Impact

Bandwidth Utilization by Pattern

Access Pattern	Efficiency	Transactions	Performance
Coalesced	100%	1	Optimal
Sequential Misaligned	80-100%	1-2	Good
Strided (stride=2)	50%	2	Poor
Strided (stride=32)	3%	32	Very Poor
Random	3-10%	32	Worst

Conclusion

GPU memory hierarchy optimization is essential for achieving peak performance. Focus on:

Coalesced access patterns for global memory
Shared memory for data reuse and communication
Avoiding bank conflicts through padding
Proper occupancy to hide memory latency
Profiling and measurement to validate optimizations

Master these concepts to unlock the full potential of GPU computing and achieve order-of-magnitude performance improvements in your applications.

GPU & High-Performance Computing

GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

Language & Framework Internals

Pinned Memory and DMA Transfers in PyTorch

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

Systems & Architecture

SoA vs AoS: Data Layout Optimization

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

GPU & High-Performance Computing

CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing

CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing

CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.