GPU Streaming Multiprocessor (SM)

Why Streaming Multiprocessors Matter

If a GPU is a factory, the Streaming Multiprocessor (SM) is an individual workshop floor. A modern GPU like the A100 has 108 of these workshop floors, each capable of running hundreds of workers (threads) simultaneously. Understanding how a single SM operates is the key to writing GPU code that actually uses the hardware well, rather than leaving most of it idle.

The SM is where every CUDA thread ultimately executes. Every performance decision you make -- block size, shared memory usage, register pressure -- plays out inside the SM. Getting it right can mean the difference between 10% and 90% hardware utilization.

The Classroom Analogy

Think of an SM as a large classroom with shared resources:

The students are individual threads. There can be up to 1,536 of them in the room at once (on modern architectures), but they do not all work independently.
Lab tables of 32 are warps. Every group of 32 students must perform the same activity at the same time. If some students in a group need to do something different (a branch), the entire table waits while subgroups take turns. This is called warp divergence, and it is one of the most common performance pitfalls on GPUs.
The whiteboard is shared memory -- a small, fast scratchpad visible to everyone in the classroom. Students can write intermediate results there for their classmates to read, enabling cooperation on problems too large for any one student.
Personal notebooks are registers -- the fastest storage, private to each student. Each student gets a fixed number of notebook pages. If a student needs more than their allotment, the overflow spills to a slow storage closet (local memory), dragging down performance.
The teacher is the warp scheduler. Rather than waiting when one lab table is stuck (say, waiting for data from main memory), the teacher instantly switches attention to another table that is ready to work. This zero-cost context switching is how GPUs hide memory latency -- not by making memory faster, but by always having other work ready to fill the gap.

Core Components of an SM

CUDA Cores

CUDA cores are the arithmetic workhorses. Each one handles a single floating-point or integer operation per clock cycle. A modern SM (Ampere architecture) packs 128 of these cores, giving it raw throughput that dwarfs any CPU core. However, individual CUDA cores are simple -- they have no branch predictor, no out-of-order execution, and no speculative logic. Their power comes entirely from quantity and parallelism.

Tensor Cores

Tensor Cores are specialized matrix engines designed for deep learning. Instead of multiplying numbers one at a time, a Tensor Core performs an entire 4x4 matrix multiply-and-accumulate (D = A x B + C) in a single operation. This yields roughly 8x the throughput of CUDA cores for matrix work. They support mixed precision formats (FP16, BF16, TF32, INT8, INT4), which is why modern training and inference workloads run dramatically faster on Tensor Core-equipped GPUs.

RT Cores

Introduced with the Turing architecture, RT Cores accelerate ray tracing by handling the two most expensive operations in hardware: traversing Bounding Volume Hierarchies (BVH) to find which objects a ray might hit, and computing exact ray-triangle intersections. Offloading this to dedicated silicon frees the CUDA cores to do shading and other work, achieving roughly 10x the ray tracing performance of a pure software approach.

Warp Schedulers

Each SM contains multiple warp schedulers (four in recent architectures). A scheduler picks a warp that has its operands ready and issues one or two instructions from it every clock cycle. Because switching between warps costs nothing -- all the state lives in the register file -- the scheduler can seamlessly interleave warps to keep the execution units busy even when some warps are waiting on memory.

The Memory Hierarchy

Understanding SM memory is essential because memory access is almost always the bottleneck. The hierarchy is designed around one principle: keep frequently used data as close to the compute units as possible.

Memory Level	Size (per SM)	Latency	Scope	Purpose
Registers	256 KB (65,536 x 32-bit)	0 cycles	Private per thread	Local variables, intermediates
Shared Memory	Up to 128 KB (configurable)	~20 cycles	Shared within block	Inter-thread communication, data reuse
L1 Cache	128 KB (combined with shared)	~30 cycles	Per SM	Caches global memory accesses
L2 Cache	40-60 MB	~200 cycles	Entire GPU	Shared across all SMs
Global Memory (HBM)	24-80 GB	~500 cycles	Entire GPU	Main data store, 1-2 TB/s bandwidth

The critical insight is the 500-cycle gap between registers and global memory. Every access to global memory that misses the caches costs roughly 500 clock cycles -- enough time for hundreds of arithmetic operations. This is why shared memory exists: it lets threads in a block collaboratively load a chunk of global memory once, then reuse it many times from the fast scratchpad.

The L1 cache and shared memory share the same 128 KB of physical SRAM, and you can configure the split. Workloads that rely on explicit data sharing benefit from more shared memory, while workloads with irregular access patterns benefit from a larger L1 cache.

Warps: The Fundamental Execution Unit

A warp is a group of 32 threads that execute in lockstep -- the same instruction, applied to different data (SIMT: Single Instruction, Multiple Threads). This is how GPUs achieve massive parallelism without the complexity of independent instruction streams for every thread.

The thread hierarchy flows like this: individual threads are grouped into warps of 32, warps are grouped into blocks of up to 1,024 threads, and blocks are grouped into a grid that represents the entire kernel launch. A block executes entirely on one SM, so all threads in a block can cooperate through shared memory and synchronization barriers. Threads in different blocks cannot directly communicate during execution.

Warp divergence happens when threads within a warp take different paths at a branch. Since all 32 threads share one instruction pointer, the hardware must serialize the divergent paths -- first executing the "if" branch for threads that took it (while the others sit idle), then executing the "else" branch. In the worst case, a warp with 32-way divergence runs at 1/32 of peak throughput. Writing branch-free or warp-uniform code is one of the most important GPU optimization techniques.

Occupancy: Balancing Resources

Occupancy measures how many warps are active on an SM compared to the maximum (typically 48 warps, or 1,536 threads). Higher occupancy means the warp scheduler has more options to hide memory latency by switching between warps.

Three resources limit occupancy:

Registers per thread: Each thread can use up to 255 registers. A kernel using 64 registers per thread across a block of 256 threads consumes 16,384 registers -- a substantial fraction of the 65,536 available. Fewer registers per thread means more blocks can fit on the SM.
Shared memory per block: If each block claims 48 KB of shared memory, only two blocks can fit in 96 KB of shared memory space. Reducing shared memory usage allows more concurrent blocks.
Threads per block: The block size itself caps how warps are distributed. Very small blocks waste scheduling slots; very large blocks may exceed per-SM limits.

The optimal occupancy depends on the workload. Memory-bound kernels benefit from high occupancy (50-75%) because more warps mean more opportunities to hide latency. Compute-bound kernels can perform well at lower occupancy (25-50%) if each thread does enough arithmetic to keep the execution units saturated.

Performance Optimization Principles

Maximize useful parallelism

Launch enough threads to keep all SMs busy. A typical target is at least several thousand threads -- block sizes of 128 to 512 threads are common sweet spots that balance occupancy with per-thread resource usage.

Optimize memory access patterns

Coalesced access is the single most important memory optimization. When 32 threads in a warp access 32 consecutive 4-byte addresses, the hardware combines them into a single efficient transaction. Scattered or misaligned accesses break coalescing and can reduce effective bandwidth by 10x or more.

Use shared memory as a programmer-managed cache. Load a tile of data from global memory into shared memory cooperatively (each thread loads one element), synchronize, then have all threads compute from the shared copy. This pattern -- called tiling -- is the foundation of high-performance GPU kernels for matrix multiplication, convolutions, and stencil computations.

Minimize warp divergence

Structure conditionals so that all threads in a warp take the same path when possible. When branching is unavoidable, organize work so that divergence happens across warps rather than within them.

Balance the instruction mix

The most efficient kernels overlap memory operations with computation. While one warp waits for data to arrive from global memory, the scheduler runs arithmetic instructions from other warps. Profiling tools like NVIDIA Nsight Compute can show whether a kernel is memory-bound, compute-bound, or latency-bound, guiding optimization strategy.

Evolution Across GPU Generations

The SM has evolved significantly while maintaining its core SIMT execution model:

Generation	Year	CUDA Cores/SM	Key Innovation
Kepler	2012	192	Dynamic parallelism (kernels launching kernels)
Maxwell	2014	128	Major power efficiency improvements
Pascal	2016	64	HBM2 memory, unified memory improvements
Volta	2017	64	First Tensor Cores, independent thread scheduling
Turing	2018	64	RT Cores for ray tracing, concurrent FP32+INT32
Ampere	2020	128	2x FP32 throughput, 3rd-gen Tensor Cores with sparsity
Ada Lovelace	2022	128	FP8 Tensor Cores, Shader Execution Reordering
Hopper	2022	128	Transformer Engine, Thread Block Clusters, 228 KB shared memory

The trend is clear: each generation adds more specialized hardware (Tensor Cores, RT Cores, Transformer Engines) while increasing shared memory capacity and improving the programmability of the core SIMT model. The SM has grown from a simple SIMD processor into a heterogeneous compute engine optimized for the dominant workloads of its era.

Key Takeaways

The SM is the GPU's fundamental building block. A GPU is an array of SMs, and each SM is a self-contained parallel processor with its own execution units, memory, and schedulers.
Warps of 32 threads execute in lockstep. Divergence within a warp serializes execution. Keeping threads uniform is critical for performance.
Memory latency is hidden by parallelism, not speed. The warp scheduler switches to ready warps while others wait on memory -- but this only works if there are enough active warps (sufficient occupancy).
Shared memory is the programmer's secret weapon. It bridges the 500-cycle gap between registers and global memory, enabling cooperative data reuse within a thread block.
Occupancy is a balancing act. Register usage, shared memory allocation, and block size all compete for the same finite SM resources. Profiling is essential to find the right balance.
Coalesced memory access is non-negotiable. Scattered global memory reads can reduce effective bandwidth by an order of magnitude. Always design data structures and access patterns with coalescing in mind.