Multi-GPU Communication: NVLink, PCIe, and NCCL
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.
11 min readConcept
Explore machine learning concepts related to multi-gpu. Clear explanations and practical insights.
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.
A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.
Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.