Tagged with

cuda

Explore machine learning concepts related to cuda. Clear explanations and practical insights.

Concepts Found

Concepts Related to cuda

August 17, 2025

GPU Memory Hierarchy & Optimization

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

GPU CUDA memory-optimization performance parallel-computing HBM cache

8 min readConcept

January 15, 2025

NVIDIA Unified Virtual Memory

NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.

cuda unified-memory gpu page-migration memory-management virtual-memory uvm nvidia

12 min readConcept

November 2, 2025

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

gpu hardware linux nvidia device-files cuda containers

11 min readConcept

November 2, 2025

CUDA Multi-Process Service (MPS)

Learn CUDA Multi-Process Service (MPS) for GPU sharing. Enable concurrent kernel execution from multiple processes and maximize GPU utilization.

hardware gpu cuda parallelism optimization

7 min readConcept

December 31, 2024

Pinned Memory and DMA Transfers

Understanding PyTorch pin_memory for faster CPU to GPU data transfers using DMA (Direct Memory Access) and page-locked memory.

pytorch gpu memory dma cuda performance

5 min readConcept

December 15, 2024

Tensor Cores: Accelerating Deep Learning

NVIDIA Tensor Cores explained: mixed-precision matrix operations delivering 10x speedups for AI training and inference on CUDA GPUs.

GPU Tensor Cores Deep Learning Matrix Multiplication CUDA

6 min readConcept

January 15, 2024

GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

GPU CUDA parallel-computing SM hardware-architecture tensor-cores RT-cores

8 min readConcept