Tagged with

gpu

Explore machine learning concepts related to gpu. Clear explanations and practical insights.

Concepts Found

Concepts Related to gpu

August 16, 2024

High Bandwidth Memory (HBM)

High Bandwidth Memory (HBM) architecture: 3D-stacked DRAM with TSV technology powering NVIDIA GPUs and AI accelerators with TB/s bandwidth.

hbm memory gpu bandwidth 3d-stacking tsv ai-hardware

11 min readConcept

August 17, 2025

GPU Memory Hierarchy & Optimization

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

GPU CUDA memory-optimization performance parallel-computing HBM cache

8 min readConcept

January 15, 2025

NVIDIA Unified Virtual Memory

NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.

cuda unified-memory gpu page-migration memory-management virtual-memory uvm nvidia

12 min readConcept

August 16, 2024

Page Migration & Fault Handling

CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.

page-migration page-fault virtual-memory tlb gpu memory-management

9 min readConcept

November 2, 2025

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

gpu hardware linux nvidia device-files cuda containers

11 min readConcept

January 26, 2025

Distributed Parallelism in Deep Learning

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

GPU Distributed Training Data Parallel Tensor Parallel Pipeline Parallel ZeRO DeepSpeed PyTorch

10 min readConcept

January 6, 2025

Understanding nvidia-modeset: Kernel Mode-Setting for NVIDIA GPUs

Learn nvidia-modeset for display configuration on Linux. Understand kernel mode-setting, DRM integration, and GPU drivers.

linux kernel nvidia gpu display drm kms

8 min readConcept

November 2, 2025

CUDA Multi-Process Service (MPS)

Learn CUDA Multi-Process Service (MPS) for GPU sharing. Enable concurrent kernel execution from multiple processes and maximize GPU utilization.

hardware gpu cuda parallelism optimization

7 min readConcept

April 15, 2025

Understanding NVIDIA Kubernetes GPU Operator

Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.

hardware gpu kubernetes infrastructure automation

8 min readConcept

April 10, 2025

Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

hardware gpu programming parallelism

3 min readConcept

January 31, 2025

SoA vs AoS: Data Layout Optimization

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

performance memory optimization SIMD GPU cache

6 min readConcept

January 30, 2025

Understanding NVIDIA Persistence Daemon

Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.

gpu nvidia performance driver optimization

11 min readConcept

January 21, 2025

Flash Attention: IO-Aware Exact Attention

Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.

llms optimization attention gpu

7 min readConcept

January 15, 2025

NCCL: High-Performance Multi-GPU Communication

Master NVIDIA NCCL for multi-GPU deep learning. Learn AllReduce, ring algorithms, and GPU-Direct communication for efficient distributed training on CUDA.

GPU NCCL Distributed Training Multi-GPU Communication Primitives AllReduce

8 min readConcept

December 31, 2024

PyTorch DataLoader Pipeline

Understanding how PyTorch DataLoader moves data from disk through CPU to GPU, including Dataset, Sampler, Workers, and Collate components.

pytorch dataloader data-pipeline deep-learning gpu

4 min readConcept

December 31, 2024

Pinned Memory and DMA Transfers

Understanding PyTorch pin_memory for faster CPU to GPU data transfers using DMA (Direct Memory Access) and page-locked memory.

pytorch gpu memory dma cuda performance

5 min readConcept

December 15, 2024

Tensor Cores: Accelerating Deep Learning

NVIDIA Tensor Cores explained: mixed-precision matrix operations delivering 10x speedups for AI training and inference on CUDA GPUs.

GPU Tensor Cores Deep Learning Matrix Multiplication CUDA

6 min readConcept

January 15, 2024

GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

GPU CUDA parallel-computing SM hardware-architecture tensor-cores RT-cores

8 min readConcept