Slurm Fundamentals: Job Scheduling on HPC Clusters
Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.
CUDA, tensor cores, multi-GPU communication, and cluster-scale workload orchestration.
Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.
High Bandwidth Memory (HBM) architecture: 3D-stacked DRAM with TSV technology powering NVIDIA GPUs and AI accelerators with TB/s bandwidth.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Compare NVLink vs PCIe bandwidth for multi-GPU training. Learn GPU topologies, NVSwitch, and choose between NCCL, Gloo, and MPI for distributed deep learning.
Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.
How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).
NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.
Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.
Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.
OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.
CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.
How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
Mastering HPC performance — Amdahl's Law, Gustafson's Law, strong vs weak scaling, roofline model, communication-computation overlap, load balancing, and profiling with Nsight and VTune.
How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.
A CUDA context is a per-device container of GPU state; a CUDA stream is an in-order execution queue inside a context. They answer different questions and operate at different lifetimes — confusing them is one of the most common ways CUDA code ends up slow or unsafe.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.
Master NVIDIA NCCL for multi-GPU deep learning. Learn AllReduce, ring algorithms, and GPU-Direct communication for efficient distributed training on CUDA.
NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy