GPU

GPU & High-Performance Computing

CUDA, tensor cores, multi-GPU communication, and cluster-scale workload orchestration.

26 concepts

All GPU & High-Performance Computing Concepts

March 15, 2026

Slurm Fundamentals: Job Scheduling on HPC Clusters

Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.

hpc slurm job-scheduling cluster-computing

No direct links0 refs

August 16, 2024

High Bandwidth Memory (HBM)

How HBM works: 3D-stacked DRAM, TSVs, and silicon interposers explained with interactive visualizations — from the memory wall to HBM4 and the roofline model.

hbm memory gpu bandwidth 3d-stacking tsv ai-hardware

No direct links0 refs

August 17, 2025

GPU Memory Hierarchy & Optimization

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

GPU CUDA memory-optimization performance parallel-computing HBM cache

No direct links0 refs

January 29, 2025

Multi-GPU Communication: NVLink, PCIe, and NCCL

How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.

multi-GPU NVLink PCIe NCCL distributed training GPU interconnect NVSwitch AllReduce InfiniBand

No direct links0 refs

March 15, 2026

Slurm GPU Allocation for Distributed Training

Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.

hpc slurm gpu-computing distributed-training

No direct links0 refs

March 15, 2026

Slurm Resource Management and Job Priority

How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).

hpc slurm job-scheduling resource-management

No direct links0 refs

January 15, 2025

NVIDIA Unified Virtual Memory

NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.

cuda unified-memory gpu page-migration memory-management virtual-memory uvm nvidia

No direct links0 refs

August 6, 2025

Flynn's Classification: Taxonomy of Computer Architectures

Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.

performance hardware architecture parallelism

No direct links0 refs

March 15, 2026

MPI Fundamentals: Message Passing for Distributed Computing

Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.

hpc mpi distributed-computing parallel-programming

No direct links0 refs

March 15, 2026

OpenMP: Shared-Memory Parallel Programming

OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.

hpc parallel-programming openmp multithreading performance c++

No direct links0 refs

August 16, 2024

Page Migration & Fault Handling

CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.

page-migration page-fault virtual-memory tlb gpu memory-management

No direct links0 refs

March 15, 2026

Slurm Accounting and Resource Tracking

How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.

hpc slurm accounting resource-management

No direct links0 refs

November 2, 2025

CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

hardware gpu cuda parallelism optimization

No direct links0 refs

March 15, 2026

HPC Performance Optimization: Scaling, Profiling, and Tuning

Mastering HPC performance — Amdahl's Law, Gustafson's Law, strong vs weak scaling, roofline model, communication-computation overlap, load balancing, and profiling with Nsight and VTune.

hpc performance optimization parallel-computing

No direct links0 refs

March 15, 2026

Slurm Backfill Scheduling: How Small Jobs Fill the Gaps

How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).

hpc slurm scheduling backfill

No direct links0 refs

June 25, 2026

NVIDIA vs AMD for Deep Learning: CUDA vs ROCm and the Datacenter Accelerators

NVIDIA vs AMD for deep learning compared at both layers: the CUDA vs ROCm software moat, the microarchitecture (warp vs wavefront, SM vs CU, Tensor vs Matrix Cores), and the datacenter accelerators (H100/H200/B200 vs MI300X/MI325X).

hardware gpu cuda rocm deep-learning

No direct links0 refs

November 2, 2025

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

gpu hardware linux nvidia device-files cuda containers

No direct links0 refs

January 26, 2025

Distributed Parallelism in Deep Learning

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

GPU Distributed Training Data Parallel Tensor Parallel Pipeline Parallel ZeRO DeepSpeed PyTorch

No direct links0 refs

May 6, 2026

CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

hardware gpu programming parallelism cuda

No direct links0 refs

May 6, 2026

CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

hardware gpu programming parallelism cuda

No direct links0 refs

April 15, 2025

Understanding NVIDIA Kubernetes GPU Operator

Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.

hardware gpu kubernetes infrastructure automation

No direct links0 refs

April 10, 2025

Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

hardware gpu programming parallelism

No direct links0 refs

January 30, 2025

Understanding NVIDIA Persistence Daemon

Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.

gpu nvidia performance driver optimization

No direct links0 refs

January 15, 2025

NCCL: How NVIDIA Collective Communication Works

A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.

GPU NCCL Distributed Training Multi-GPU Communication Primitives AllReduce

No direct links0 refs

December 15, 2024

Tensor Cores Explained: Mixed Precision & Matrix Acceleration

NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.

GPU Tensor Cores Deep Learning Matrix Multiplication CUDA

No direct links0 refs

January 15, 2024

GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

GPU CUDA parallel-computing SM hardware-architecture tensor-cores RT-cores

No direct links0 refs