CUDA Matrix Multiplication: From Naive to Near-cuBLAS
Step-by-step CUDA matrix multiplication optimization with 9 interactive visualizations. From naive kernels through shared memory tiling to near-cuBLAS speeds.
Explore technical articles related to performance. Find in-depth analysis, tutorials, and insights.
Step-by-step CUDA matrix multiplication optimization with 9 interactive visualizations. From naive kernels through shared memory tiling to near-cuBLAS speeds.
Explore TensorRT optimization: layer fusion, INT8 quantization, kernel auto-tuning, and deployment strategies with 8+ interactive visualizations.
Dive deep into Kernel Fusion, a technique that combines multiple neural network operations into unified kernels improving performance in deep learning models.
Deep dive into CPython internals: bytecode compilation, memory management, the GIL, object model, and garbage collection.