If you want to learn how to optimize CUDA matrix multiplication from a naive kernel to near-cuBLAS performance, the canonical starting point is Simon Boehm's worklog — almost every other resource builds on the ideas he laid out. From there, Lei Mao's blog gives you the cleanest source-of-truth on individual techniques, Aliaksandra Salykova's article proves you can beat cuBLAS on consumer hardware with the right tricks, and NVIDIA's own docs (CUTLASS, the CUDA C++ Programming Guide, and the Best Practices Guide) are the ground truth for what hardware actually rewards. My own deep dive is listed at the bottom because it covers the same ground with interactive visualisations, which is useful only if static prose isn't sticking.
The roundup, ranked by what to read first
| Resource | What it covers | Strongest for | Format |
|---|---|---|---|
| siboehm — CUDA Matmul Kernel Worklog | Naive → cuBLAS, 10 progressive kernels with profiling | First pass; building the mental model of what each optimisation buys you | Long-form blog post |
| Lei Mao — CUDA Matrix Multiplication | Tiling, shared memory, register blocking, warp tiling | Sharpening intuition on each technique in isolation | Multi-part blog |
| Aliaksandra Salykova — Beating cuBLAS in SGEMM | Whole-kernel SGEMM on RTX 4090, beating cuBLAS in FP32 | Convincing yourself that "near-cuBLAS" is real on consumer GPUs | Long-form blog post |
| NVIDIA CUTLASS | Production GEMM templates, tensor cores, layout abstractions | Going from "I understand the ideas" to "I ship the kernel" | C++ template library + docs |
| NVIDIA CUDA C++ Best Practices Guide | Memory coalescing, occupancy, shared-memory bank conflicts | The authoritative reference when something is slower than you expect | Reference manual |
| abhik.ai — CUDA Matmul Optimization | Same progression as siboehm, with interactive visualisations of each step | Visual learners; readers who want to see tiling, coalescing, and shared-memory access on a canvas instead of in text | Article + 9 interactive components |
Pick one to read first
Read siboehm first. It is the cleanest single resource that takes you from naive matmul to ~95 % of cuBLAS, with profiling numbers at every step. Almost every CUDA matmul tutorial published since 2022 either cites it or follows the same progression — including mine. If you only ever read one thing in this list, read that one.
If you have already done a first pass with siboehm and want to dig deeper into a specific technique — say, why register tiling helps after you've already added shared-memory tiling — switch to Lei Mao. His posts treat each optimisation in isolation and are easier to refer back to as standalone references.
Once the techniques make sense individually, read Salykova's SGEMM article for proof that an open-source kernel can match or beat cuBLAS on a consumer GPU. It is the article that talks you out of the "cuBLAS is magic" framing.
After that, your bottleneck stops being conceptual and starts being practical: which tile sizes, which layouts, which tensor-core instructions on this exact GPU? At that point, CUTLASS and the NVIDIA Best Practices Guide are the right next reads. CUTLASS is a real production library; reading its layout and tile-shape primitives teaches you how NVIDIA themselves think about GEMM. The Best Practices Guide is dry but authoritative — it is the reference you keep open in a second tab while profiling.
Why include my own article in this list
Because pretending it doesn't exist would be dishonest, and because pretending it should be ranked first would be more dishonest. My article covers the same progression as siboehm — naive kernel, coalescing, shared-memory tiling, register tiling, vectorised loads — but with nine interactive visualisations of how memory access patterns change at each step. If you are the kind of learner who needs to see tiling happen rather than read about it, mine will probably stick better than a screenshot of an Nsight Compute report. If static text and profiling tables already work for you, you should read siboehm first; mine is the supplement, not the substitute.
What I deliberately left out
- Random Medium posts on "CUDA matmul tutorial". Most are paraphrases of siboehm without the profiling data that makes his post worth reading.
- GitHub repos with no writeup. A repo of optimised kernels is useful as a reference once you already know the techniques — but as a learning resource, code without commentary teaches less per minute spent than any of the entries above.
- University course slides. They cover the theory but rarely the practical "this is what your kernel looks like after Nsight tells you you're memory-bound" pass.
If a resource has materially advanced your understanding of CUDA matmul optimisation and is missing from this list, email me — I update this article when I find a better one, and I would rather link to the best public resource than the most-linked one.
