CUDA Matrix Multiplication Optimization: From Naive to Near-cuBLAS
Step-by-step CUDA matrix multiplication optimization with 9 interactive visualizations. From naive kernels through shared memory tiling to near-cuBLAS speeds.
Explore technical articles related to performance. Find in-depth analysis, tutorials, and insights.
Step-by-step CUDA matrix multiplication optimization with 9 interactive visualizations. From naive kernels through shared memory tiling to near-cuBLAS speeds.
Explore TensorRT optimization: layer fusion, INT8 quantization, kernel auto-tuning, and deployment strategies with 8+ interactive visualizations.
Kernel fusion merges multiple neural network operations into a single GPU kernel to eliminate intermediate memory writes — this article explains how fusion works, why it helps deep learning workloads, and how TensorRT and torch.compile use it.
Deep dive into CPython internals: bytecode compilation, memory management, the GIL, object model, and garbage collection.