Multi-GPU Communication: NVLink, PCIe, and NCCL
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.
11 min readConcept
Explore machine learning concepts related to allreduce. Clear explanations and practical insights.
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.
A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.