GPU Memory Hierarchy & Optimization
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Explore machine learning concepts related to cuda. Clear explanations and practical insights.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.
Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.
A CUDA context is a per-device container of GPU state; a CUDA stream is an in-order execution queue inside a context. They answer different questions and operate at different lifetimes — confusing them is one of the most common ways CUDA code ends up slow or unsafe.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy