CUDA Matrix Multiplication Optimization: From Naive to Near-cuBLAS

Matrix multiplication is the single most important kernel in GPU computing. Every linear layer, every convolution, every attention head in every transformer — all of them reduce to a general matrix multiply (GEMM) at their core. When you call torch.mm() or tf.matmul(), you’re invoking a kernel that has been optimized for decades. NVIDIA’s cuBLAS library achieves over 90% of a GPU’s theoretical peak floating-point throughput on this one operation. Understanding how it gets there is understanding how GPUs actually work.

The textbook algorithm is O(N³): computing the product of two N×N matrices requires 2N³ floating-point operations. GPUs should be perfect for this — thousands of cores running in parallel, hundreds of gigabytes per second of memory bandwidth. An A6000 GPU can deliver 23,270 GFLOP/s of single-precision throughput. But a naive CUDA implementation of matrix multiplication achieves only about 309 GFLOP/s — 1.3% of peak. The algorithm is correct. The arithmetic is identical. The entire gap is about how data moves through the memory hierarchy.

This article walks through six optimization stages that close that gap, from 1.3% to 93.7% of cuBLAS performance. Each stage targets a specific hardware bottleneck — global memory coalescing, shared memory reuse, register-level data reuse, vectorized loads — and each one comes with an interactive visualization so you can see exactly what changes and why it matters. By the end, you’ll have a concrete mental model for how high-performance GPU kernels are designed, and why the same principles apply to every operation in deep learning.

All performance numbers in this article are measured on an NVIDIA RTX A6000 GPU (Ampere architecture, GA102 chip) with 48 GB GDDR6X running CUDA 12.x. The matrices are 4096×4096 single-precision floats unless otherwise noted. The cuBLAS baseline for this configuration is 23,270 GFLOP/s.

The Naive Triple Loop

The textbook matrix multiplication algorithm is three nested loops. To compute C = A × B where A is M×K and B is K×N, each element of the output matrix C[i][j] is the dot product of row i of A and column j of B:

for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < K; k++)
            C[i][j] += A[i][k] * B[k][j];

This requires exactly 2·M·N·K floating-point operations (one multiply and one add per iteration of the inner loop). For square matrices of size 4096, that’s 2 × 4096³ = 137.4 billion FLOPs. Each element of C requires K multiply-accumulate operations — a full dot product between a row of A and a column of B.

On a GPU, the naive approach maps naturally: assign each thread to compute one element of C. Thread (i, j) reads the entire row i of A and the entire column j of B, accumulates the dot product, and writes the result. With 4096 × 4096 = 16.7 million threads running across dozens of SMs, this seems like it should saturate the hardware. It doesn’t — not even close.

The CUDA kernel looks like this:

__global__ void matmul_naive(float *A, float *B, float *C,
                              int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++)
            sum += A[row * K + k] * B[k * N + col];
        C[row * N + col] = sum;
    }
}

Each thread loops through the entire K dimension, performing one multiply-add per iteration. The arithmetic is correct and the parallelism is massive. But the kernel achieves only 309 GFLOP/s — less than 2% of what this GPU can deliver.

The visualization above shows the access pattern for a single output element. To compute C[i][j], the thread must read K elements from row i of A and K elements from column j of B. When thousands of threads do this simultaneously, the total memory traffic is staggering: every thread independently reads the full K dimension from both A and B, with no data sharing between threads. The access pattern for B is especially wasteful — and the reason comes down to how matrices are stored in memory.

Why Naive is Cache-Hostile

A 2D matrix is stored in memory as a 1D array, laid out row by row. In row-major order, element [i][j] of an N-column matrix lives at memory address base + i*N + j. Consecutive elements within the same row are adjacent in memory. Consecutive elements within the same column are N positions apart.

This layout has a profound consequence for the naive algorithm. When a thread reads row i of A, it accesses A[i][0], A[i][1], A[i][2], ... — consecutive memory addresses that will be served from a single cache line fetch. But when it reads column j of B, it accesses B[0][j], B[1][j], B[2][j], ... — each element is N positions apart in memory. Every single access to B misses the cache and fetches a new cache line, of which only one element (4 bytes out of 64 or 128) is actually used. The remaining bytes in each cache line are loaded and immediately discarded.

Consider a concrete example with N=4096. A cache line is 128 bytes (32 floats) on modern GPUs. When accessing column j of B:

B[0][j] loads a cache line covering B[0][j] through B[0][j+31] — 32 elements from row 0. Only B[0][j] is needed. 31 elements wasted.
B[1][j] is 4096 positions away in memory — an entirely different cache line. Another 128-byte fetch for 4 useful bytes.
Repeat for all 4096 rows: 4096 cache line fetches, each using 3.1% of the loaded data.

This is the fundamental problem: the naive algorithm has a compute-to-memory ratio of O(1). Each loaded value is used exactly once and then evicted. For every multiply-add (2 FLOPs), we load at least one value from slow memory (8 bytes in FP32 round-trip for loads and stores). That puts us firmly in the memory-bound regime, nowhere near the GPU’s compute capability.

To quantify: the A6000 has 768 GB/s of memory bandwidth. At 8 bytes per FLOP pair, the maximum achievable throughput is 768/8 = 96 GFLOP/s if we’re purely memory-bound. The naive kernel gets 309 GFLOP/s only because the L2 cache absorbs some of the redundant reads. But we need to get to 23,270 GFLOP/s — that requires a 242× improvement in the compute-to-memory ratio.

The fix is tiling: divide the matrices into small blocks that fit in fast memory, load each block once, and reuse it for multiple output elements. If a tile is T×T, each loaded value gets reused T times instead of once. The compute-to-memory ratio rises from O(1) to O(T) — and that’s the difference between 1% and 90% of peak performance.

Tiled Matrix Multiplication: The Fix

The tiling algorithm divides A, B, and C into tiles of size T×T. Instead of computing each output element independently, we compute an entire T×T block of the output at once. The output tile C[ti][tj] is the sum of partial products across k-slices:

C[ti][tj] = sum over k of (A[ti][k] × B[k][tj])

For each k-slice, we load a T×T tile from A and a T×T tile from B into fast memory. Then we multiply these two small tiles together (a T×T matrix multiply using 2T³ FLOPs) and accumulate the result into the output tile. We repeat this for each k-slice until the entire dot product dimension has been covered.

The key insight: each tile of A and B contains T² elements and is loaded once from slow memory. But each element participates in T multiply-add operations during the tile multiply. So the data reuse factor is T — we get T times more compute per byte loaded.

Let’s make this concrete with T=32. For one output tile of C:

Data loaded: 2 × 32² = 2,048 floats per k-step, times K/32 k-steps = 2,048 × 128 = 262,144 floats total for 4096×4096 matrices
Compute: 2 × 32³ per k-step × 128 k-steps = 8.4 million FLOPs
Compute-to-memory ratio: 8.4M / (262K × 4 bytes) = 8.0 FLOPs per byte — an 8× improvement over naive

On a CPU, "fast memory" means L1 cache. With 32 KB of L1 cache, we can fit tiles up to about 64×64 in FP32 (two tiles of 64×64×4 bytes = 32 KB). On a GPU, the equivalent is shared memory — an explicitly managed on-chip scratchpad that sits right next to the compute units. The tiling principle is identical on both architectures. What changes is the memory hierarchy and the programming model for exploiting it.

The transition from CPU to GPU tiling introduces several new concerns: how threads cooperate to load tiles, how memory accesses are coalesced across warps, and how the barrier synchronization between load and compute phases is managed. These GPU-specific details are what make the next three sections critical.

Translation to GPU: Warps and Coalescing

Before diving into shared memory tiling on the GPU, we need to understand how GPU memory access works at the hardware level. The critical concept is memory coalescing.

A GPU organizes threads into warps of 32 threads that execute in lockstep. When threads in a warp issue memory requests, the hardware tries to combine (coalesce) their individual 4-byte requests into as few memory transactions as possible. If all 32 threads access 32 consecutive 4-byte addresses, the hardware can serve this as a single 128-byte transaction. If the 32 threads access scattered addresses, each thread’s request may become a separate transaction — up to 32 individual fetches for data that could have been served in one.

The coalescing rules are straightforward:

Best case: thread i accesses address base + i * 4 (32 consecutive floats). One 128-byte transaction serves all 32 threads.
Offset case: thread i accesses address base + i * 4 + offset. If the addresses span two cache lines, two transactions are needed.
Strided case: thread i accesses address base + i * stride * 4. If stride is large (e.g., N=4096), each thread hits a different cache line. 32 transactions instead of 1 — a 32× bandwidth penalty.
Random case: addresses are unrelated. Up to 32 separate transactions, each fetching 128 bytes for 4 useful bytes.

In the naive GPU matmul, each thread computes one element of C. Threads within a warp compute adjacent columns of the same row: thread 0 computes C[i][0], thread 1 computes C[i][1], and so on. When these threads load their values from A, they all read the same row — this is fine (it can be broadcast). But when they load from B, thread 0 reads B[k][0], thread 1 reads B[k][1], etc. In row-major storage, B[k][0] and B[k][1] are adjacent in memory — so loads from B are actually coalesced.

The problem is subtler than it appears. While B accesses coalesce, the total number of global memory transactions is still enormous because there is zero data reuse across threads. Every thread independently loads the entire k-dimension from both A and B. With M×N threads each loading K elements from A and K elements from B, the total global memory traffic is O(M·N·K) — proportional to the number of FLOPs rather than the matrix size. The memory bus becomes the bottleneck long before the compute units are saturated.

This is why the naive CUDA kernel achieves only 309 GFLOP/s on an A6000 — 1.3% of its 23,270 GFLOP/s peak. The arithmetic units sit idle, waiting for data to arrive from global memory (HBM). The solution is the same as on CPU: tile the computation so that data loaded from global memory is reused many times. On GPU, the fast reuse buffer is shared memory.

Shared Memory: The GPU Tile Cache

Each streaming multiprocessor (SM) on a GPU has a bank of on-chip memory called shared memory — typically 48 to 228 KB depending on the GPU generation. This memory sits physically next to the compute cores, with bandwidth exceeding 12,000 GB/s — over 16× faster than global memory (HBM). Unlike L1/L2 caches, shared memory is explicitly managed by the programmer. You decide exactly what goes in and when.

The shared memory tiling kernel follows a simple pattern:

Cooperative load: All threads in a thread block work together to load a tile of A and a tile of B from global memory (HBM) into shared memory. Each thread loads one or a few elements.
Barrier synchronize: Call __syncthreads() to ensure every thread’s load has completed before any thread starts reading from the shared buffer.
Compute from shared memory: Each thread computes its portion of the output tile by reading from the fast shared memory arrays instead of slow global memory.
Advance to next tile: Move the tile window along the K dimension and repeat steps 1–3.

The kernel looks something like this:

#define TILE 32

__global__ void matmul_smem(float *A, float *B, float *C,
                             int M, int N, int K) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < K; t += TILE) {
        // Cooperative load: each thread loads one element
        As[threadIdx.y][threadIdx.x] = A[row * K + t + threadIdx.x];
        Bs[threadIdx.y][threadIdx.x] = B[(t + threadIdx.y) * N + col];
        __syncthreads();

        // Compute from shared memory
        for (int k = 0; k < TILE; k++)
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        __syncthreads();
    }
    C[row * N + col] = sum;
}

This pattern reduces global memory traffic by a factor of T (the tile size). Instead of each thread independently loading the full K dimension, the entire thread block cooperatively loads a T×T tile and every thread in the block reuses it. With T=32, the global memory traffic drops from O(M·N·K) to O(M·N·K / T) — a 32× reduction.

For more on how shared memory fits into the broader GPU memory hierarchy, see the GPU Memory Hierarchy concept page. For details on how SMs are organized, see Streaming Multiprocessor.

Shared memory tiling alone brings performance from around 1,987 to 2,980 GFLOP/s — a 1.5× improvement. That might seem modest given the 32× reduction in global memory traffic, but it’s because the bottleneck has shifted. Global memory bandwidth is no longer the primary constraint. Now the limiting factors are:

Per-thread compute density: each thread computes just one output element, so the ratio of shared memory loads to FLOPs is still low. Each iteration of the inner loop performs 2 FLOPs but requires 2 shared memory reads.
Shared memory bank conflicts: shared memory is divided into 32 banks. When multiple threads in a warp access the same bank simultaneously, their requests are serialized. In the naive tile layout, column-wise access patterns can trigger 32-way bank conflicts.
Thread block occupancy: with 32×32 = 1,024 threads per block, only one or two blocks can reside on each SM simultaneously, limiting the scheduler’s ability to hide memory latency by switching between warps.

The next optimization addresses the biggest of these: per-thread compute density.

Register Tiling: More Work Per Thread

In the basic shared memory kernel, each thread computes exactly one element of the output matrix. It loads one value from A’s tile and one value from B’s tile per iteration of the inner loop, performs one multiply-add, and moves on. That’s a 1:1 ratio of memory loads to FLOPs — still heavily memory-bound, just from shared memory instead of global memory.

The solution is register tiling: have each thread compute a small block of output elements instead of just one. If a thread computes a TM×TN block (say, 8×8 = 64 elements), it loads TM values from A’s tile into registers and TN values from B’s tile into registers. Then it computes their outer product — TM×TN = 64 multiply-adds — from those TM+TN = 16 loaded values. Each loaded value is reused either TM or TN times.

The outer product approach works as follows. For each position k within the shared memory tile:

Load regM[0..7] = As[threadRow*8 + 0..7][k] — 8 values from A’s tile
Load regN[0..7] = Bs[k][threadCol*8 + 0..7] — 8 values from B’s tile
Compute the 8×8 outer product: result[m][n] += regM[m] * regN[n] for all 64 pairs

Step 3 performs 64 multiply-adds (128 FLOPs) from just 16 shared memory loads. That’s 8 FLOPs per load, compared to 2 FLOPs per load in the basic kernel.

The arithmetic intensity improvement is dramatic:

1×1 tiling (basic kernel): 2 shared memory loads per 2 FLOPs = 1.0 FLOP/load
8×8 tiling: 16 shared memory loads per 128 FLOPs = 8.0 FLOPs/load
Reduction in shared memory traffic: 8× fewer loads per output element

But that’s only the shared memory side. The effect cascades to global memory too. With 8×8 register tiling, each thread block computes a much larger output tile with the same shared memory capacity. A 128×128 output tile is computed by 256 threads (16×16 blocks of 8×8 per thread), using the same 32×32 shared memory tiles as before. The global memory loads per output element drop by roughly 32× compared to the naive kernel.

Going from 1×1 to 8×8 per-thread output with optimized block dimensions (128×128 output tile, 8×8 per thread, 16×16 = 256 threads per block) pushes performance to 15,972 GFLOP/s — 68.7% of cuBLAS. This single optimization delivers the largest performance jump in the entire pipeline.

The register file on a GPU is the fastest memory available — zero latency, no bank conflicts, and massive aggregate bandwidth. An A6000 SM has 256 KB of register file, far larger than its shared memory. Each thread in our 8×8 tiling kernel uses 64 registers for the accumulator (one per output element), plus 16 for the loaded values, plus a few dozen for loop variables and addresses — roughly 100 registers per thread. With 256 threads per block, that’s 25,600 registers — well within the SM’s capacity.

By staging data into registers via the outer product pattern, we exploit this fast storage to its fullest. The bottleneck shifts once more: now the limiting factor is how efficiently we load data from global memory into shared memory.

Vectorized Memory Access

Even with register tiling, the kernel still loads data from global memory one float at a time — 32-bit (4-byte) loads. But the GPU’s memory bus is 128 bits wide per thread. Loading a single float uses only 25% of the available bus width. The remaining 75% is wasted on every transaction.

The fix is vectorized loads: instead of loading one float (4 bytes) per instruction, load a float4 (16 bytes) — four consecutive floats in a single 128-bit transaction. This requires the data to be 16-byte aligned and contiguous in memory.

The code change is straightforward:

// Before: scalar load (32-bit transaction)
float val = A[row * K + col];

// After: vectorized load (128-bit transaction)
float4 vec = reinterpret_cast<float4*>(&A[row * K + col])[0];
// vec.x, vec.y, vec.z, vec.w contain 4 consecutive floats

For tile loads from A, this works naturally in row-major layout: consecutive elements within a row are contiguous. But for the shared memory layout, there’s a subtlety. During the compute phase, each thread needs to read down a column of A’s shared memory tile (to gather TM values for the outer product). Column access in a row-major shared memory array triggers bank conflicts and prevents vectorized reads.

The solution is to transpose A’s tile in shared memory during the load phase. Instead of storing A’s tile in its natural row-major order, we store it transposed so that what was a column becomes a row. Now, column access in the logical matrix becomes row access in physical shared memory — contiguous, bank-conflict-free, and vectorizable.

The transpose costs nothing extra: the data is being loaded from global memory anyway. We just write it to a different shared memory offset during the cooperative load phase. The global memory access pattern stays coalesced (reading rows of A), and the shared memory layout is rearranged for optimal compute-phase access.

The combination of float4 global memory loads and shared memory transpose pushes performance to 18,237 GFLOP/s — 78.4% of cuBLAS. The final gap is closed by warp-level tiling optimizations: arranging the thread-to-data mapping within each warp so that threads within a warp access values that are physically close in the register file, maximizing register-level locality.

With warp tiling, the kernel reaches 21,779 GFLOP/s — 93.7% of cuBLAS on an A6000 GPU. Warp tiling ensures that the 32 threads in a warp collectively work on a contiguous sub-tile of the output, so their shared memory access patterns are maximally coalesced and their register usage patterns allow the compiler to minimize register spills.

The Full Optimization Pipeline

From 309 GFLOP/s to 21,779 GFLOP/s — a 70× improvement on the exact same hardware, computing the exact same result. Each optimization targets a specific bottleneck in the memory hierarchy:

Kernel	GFLOP/s	% of cuBLAS	Key Optimization
Naive	309	1.3%	Baseline — one thread per output element
Global coalescing	1,987	8.5%	Coalesced global memory reads
Shared memory tiling	2,980	12.8%	Cooperative tile loads into SMEM
1D block tiling	8,474	36.4%	Multiple results per thread (1D)
2D block tiling	15,972	68.7%	8×8 outer product per thread
Vectorized loads	18,237	78.4%	float4 loads + SMEM transpose
Warp tiling	21,779	93.7%	Warp-level register locality

Notice the pattern: the biggest gains come from reducing data movement, not from increasing compute. The GPU had enough arithmetic throughput from the start. Every optimization is about feeding data to the compute units more efficiently — loading it fewer times, loading it in larger chunks, and keeping it in faster memory for longer.

The remaining 6.3% gap comes from cuBLAS’s collection of hundreds of specialized kernel variants. cuBLAS includes kernels hand-tuned for specific matrix sizes, data types, GPU architectures, and even specific aspect ratios (e.g., tall-skinny vs. square matrices). It uses autotuning to select the fastest variant at runtime. Reaching that last few percent requires the same kind of exhaustive, architecture-specific specialization — but the principles behind each optimization are exactly what we’ve covered here.

Beyond Tiling: Async, Tensor Cores, and Scheduling

The techniques above reach 93.7% of cuBLAS on Ampere-generation GPUs using standard CUDA cores. Modern GPU architectures — Hopper (H100) and Blackwell (B200) — add dedicated hardware accelerators that push GEMM performance another 10–40× beyond what CUDA cores alone can deliver.

These accelerators change the performance landscape entirely:

Tensor Cores perform entire small matrix multiplies (e.g., 16×16×16 in FP16) in a single instruction, delivering 15× the throughput of CUDA cores for supported data types. On H100, tensor cores provide 989 TFLOP/s in FP16, compared to 67 TFLOP/s for CUDA cores.
TMA (Tensor Memory Accelerator) offloads global-to-shared-memory copies to a dedicated DMA engine, freeing the CUDA cores to compute while data transfers happen asynchronously. This eliminates the "load then compute" serialization that limits our shared memory kernel.
Warp Group MMA allows multiple warps (up to 4, forming a 128-thread warp group) to cooperatively feed data to tensor cores, keeping them saturated even as their throughput increases with each GPU generation.
Persistent kernels keep thread blocks alive across multiple output tiles, amortizing launch overhead and enabling software pipelining of memory and compute stages. A persistent kernel processes its entire tile assignment without returning to the host.

The pipeline on Hopper uses a multi-stage software pipeline: while one warp group computes the current tile using tensor cores, another warp group prefetches the next tile from global memory via TMA. This overlap of compute and memory access hides latency entirely — the compute units never stall waiting for data.

For details on tensor core architecture, see the Tensor Cores concept page. For how high-bandwidth memory (HBM) feeds these accelerators, see HBM Memory.

State-of-the-art GEMM kernels on H100 GPUs reach 764+ TFLOP/s in FP16 — over 35× faster than our best CUDA-core kernel on A6000. This performance comes not from better algorithms but from hardware that is purpose-built for the tiled matrix multiply pattern we’ve been studying. The tiling hierarchy (global → shared → register → tensor core) remains the same; the hardware just executes each level faster.

What This Means for Deep Learning

This isn’t academic. Matrix multiplication is the computational bottleneck of every neural network, and the optimizations above directly determine how fast your models train and run inference.

Every major deep learning operation is a GEMM in disguise:

Linear layers: output = input @ weights is a direct matrix multiply. A transformer with hidden size 4096 and batch size 32 performs a 32×4096 × 4096×4096 GEMM for every linear layer. A 70B parameter model has approximately 80 such layers, each with multiple linear projections.
Convolutions: the im2col algorithm reshapes input patches into a matrix, converting convolution into GEMM. A ResNet-50 forward pass performs over 3.8 billion multiply-adds, nearly all of which are GEMMs after im2col transformation. Alternatively, Winograd and FFT-based methods also rely on matrix multiplies at their core.
Attention: the self-attention mechanism requires two GEMMs per head — Q·K⊃T to compute attention scores and Attention·V to compute the output. For a sequence length of 2048 with 32 heads, that’s 64 separate matrix multiplies per layer.
Gradient computation: backpropagation doubles the GEMM count. Each forward-pass GEMM produces two GEMMs in the backward pass (one for the input gradient, one for the weight gradient). Training is roughly 3× the compute of inference.

When you call torch.mm() or F.linear(), PyTorch dispatches to cuBLAS (or cuDNN for convolutions), which runs exactly the optimizations described in this article. Understanding tiled matmul gives you practical intuition for:

Why batch size affects GPU utilization: larger batches produce larger matrices, which tile more efficiently and keep more SMs occupied. A batch size of 1 may use only 10% of the GPU’s compute capability because the matrices are too narrow to fill all SMs with work.
Why sequence length matters for attention: longer sequences produce larger Q·K⊃T matrices. The tiling efficiency improves with matrix size, which is why attention gets relatively cheaper (per token) as sequences grow — until the O(N²) memory footprint becomes the bottleneck.
Why mixed precision is 2–8× faster: FP16 and BF16 enable tensor cores, which execute matrix multiplies in dedicated hardware at 15× the throughput of CUDA cores. This is why nearly all modern training runs use mixed precision.
Why model parallelism works: tensor parallelism splits weight matrices across GPUs. Each GPU computes a partial GEMM on a smaller matrix, then the results are combined via all-reduce. The communication cost is proportional to the output size, not the FLOP count.

For how GPUs communicate during distributed training, see NCCL Communication. For the parallelism strategies that split GEMMs across devices, see Distributed Parallelism.

Key Takeaways

Memory access patterns determine GPU performance — the naive kernel achieves 1.3% of peak because of non-coalesced global memory access. The algorithm is fine; the data movement is catastrophic.
Tiling is the fundamental optimization — load a block into fast memory, reuse it many times. On CPU it’s L1 cache. On GPU it’s shared memory. The principle is identical.
Register tiling multiplies the gains — computing an 8×8 block per thread instead of 1 element reduces memory loads by 32× and reaches 68.7% of cuBLAS.
Vectorization and warp tiling close the gap — float4 loads and warp-level register locality push to 93.7% of cuBLAS.
Tensor cores change the game entirely — dedicated matrix hardware delivers 15× the throughput of CUDA cores, making bf16 the default for training.