PyTorch DataLoader Pipeline
PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.
Explore machine learning concepts related to gpu. Clear explanations and practical insights.
PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.
How HBM works: 3D-stacked DRAM, TSVs, and silicon interposers explained with interactive visualizations — from the memory wall to HBM4 and the roofline model.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.
Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.
CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.
Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.
Learn nvidia-modeset for display configuration on Linux. Understand kernel mode-setting, DRM integration, and GPU drivers.
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.
Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.
A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.
NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy