Skip to main content

Multi-GPU Communication: NVLink, PCIe, and NCCL

Summary
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.

A single GPU never waits for a network. The moment you split a model or a batch across two of them, your training speed stops being a compute number and becomes a plumbing number: how fast can device A's gradients reach device B? Synchronizing 1 GB of gradients takes about 0.6 ms over NVLink 5, 16 ms over a PCIe link (64 GB/s), and 80 ms over a 100-gigabit Ethernet link — same model, same math, a 130× spread decided entirely by wiring.

This page is about that wiring: the physical links, the shapes they're arranged in, the collective operations that run over them, and the libraries that pick the algorithms.

The Bandwidth Cliff

Every byte a GPU exchanges lives somewhere on a steep slope. On-package HBM moves terabytes per second; every step away from the die — to a sibling GPU, out of the chassis, across the network — costs roughly an order of magnitude:

The cliff explains nearly every design decision in distributed training. Tensor parallelism stays inside an NVLink domain because it communicates constantly; data parallelism tolerates crossing nodes because it synchronizes once per step; hierarchical AllReduce exists so traffic falls down the cliff exactly once instead of N times. If you remember one picture from this page, make it this one. (For the top of the cliff, see the HBM page.)

Interconnect Technologies

PCIe: the universal floor

Every GPU speaks PCIe — it's how the CPU loads data, and on systems without NVLink it's also how GPUs reach each other.

GenerationBandwidth (×16, each direction)Mainstream since
PCIe Gen316 GB/s2012
PCIe Gen432 GB/s2019
PCIe Gen564 GB/s2023
PCIe Gen6128 GB/s2026 (servers)

Universal, vendor-neutral — and shared. GPU-to-GPU traffic over PCIe contends with storage and network devices on the same fabric, which is why even Gen6 is the cliff's middle ledge, not its top.

NVLink is NVIDIA's point-to-point GPU link. The table lists both the per-link rate and the number that actually matters — the per-GPU aggregate with all links in use:

GenerationGPULinks × per-linkPer-GPU aggregate
NVLink 1.0P100 (2016)4 × 40 GB/s160 GB/s
NVLink 2.0V100 (2017)6 × 50 GB/s300 GB/s
NVLink 3.0A100 (2020)12 × 50 GB/s600 GB/s
NVLink 4.0H100 (2022)18 × 50 GB/s900 GB/s
NVLink 5.0B200 (2024–26)18 × 100 GB/s1.8 TB/s

All figures bidirectional. Latency sits around a microsecond, and transfers bypass the CPU entirely. The catch: NVLink is NVIDIA-only, and since Ada (RTX 40-series) it's gone from consumer cards entirely — multi-GPU on gaming hardware means PCIe, full stop.

NVSwitch: from node to rack

Point-to-point links don't scale to "everyone talks to everyone." NVSwitch is dedicated switch silicon that gives full bisection bandwidth — every GPU pair communicates at the full per-GPU rate simultaneously.

  • HGX H100 node: 8 GPUs through 4 NVSwitch chips — 900 GB/s any-to-any.
  • GB200 NVL72 rack: the 2026 flagship dissolves the node boundary — 72 Blackwell GPUs across 18 compute trays wired through 9 NVLink-switch trays into one NVLink domain with 130 TB/s of aggregate bandwidth. Each NVLink 5 Switch chip carries 144 ports at 14.4 TB/s. The "node" is now the rack.

InfiniBand, RoCE, and EFA: crossing machines

Past the rack, traffic rides the network:

TechnologyPer-port rateIn GB/sNotes
InfiniBand HDR200 Gb/s25Common in A100-era clusters
InfiniBand NDR400 Gb/s50H100-era standard
InfiniBand XDR800 Gb/s100Quantum-X800, Blackwell-era
RoCE v2100–800 Gb/svariesRDMA over (lossless) Ethernet
AWS EFA (v3)up to 3,200 Gb/s400Aggregate per P5en instance

GPUDirect RDMA lets the NIC read and write GPU memory directly — no CPU bounce buffer — and modern training nodes pair one NIC per GPU to widen the pipe. Switch-side, Quantum-X800 adds SHARP in-network reduction: the switch itself sums tensors mid-flight, and NCCL 2.27 uses it on both InfiniBand and NVLink fabrics.

GPU Topologies

How your GPUs are wired sets your scaling ceiling before you write a line of code. Explore the four shapes that matter — click any device or link:

Four regimes, in ascending order of money:

  1. Consumer PCIe box — two cards sharing the PCIe fabric. Fine for experimentation; AllReduce tops out at the slot bandwidth, and there is no NVLink escape hatch on 40/50-series cards.
  2. HGX 8-GPU node — the workhorse. NVSwitch makes topology invisible inside the node: no lucky or unlucky GPU pairs.
  3. GB200 NVL72 rack — 72 GPUs, one NVLink domain. Models that needed multi-node tensor parallelism now fit inside a single fabric running at NVLink speed.
  4. Multi-node cluster — NVLink inside each node or rack, InfiniBand/EFA between them, and an ~18× bandwidth gap at the boundary (1.8 TB/s in-rack vs 100 GB/s per XDR port) that every framework works around with hierarchy.

Collective Operations

Distributed training speaks a small vocabulary of collectives. Watch each one move actual chunks around a 4-GPU ring:

What each is for, with the PyTorch one-liner:

AllReduce — everyone contributes, everyone gets the sum. The heartbeat of data parallelism; DDP calls it on your gradients after every backward pass.

torch.distributed.all_reduce(tensor, op=dist.ReduceOp.SUM)

Broadcast — one rank sends, all receive. Weight initialization at startup.

torch.distributed.broadcast(tensor, src=0)

AllGather — everyone shares their shard, everyone ends with all shards. Tensor-parallel activations; ZeRO parameter gathering.

torch.distributed.all_gather(tensor_list, tensor)

ReduceScatter — sum like AllReduce, but each rank keeps only its slice. ZeRO gradient sharding; exactly half of a ring AllReduce.

torch.distributed.reduce_scatter(output, input_list)

Point-to-point — direct send/recv between two ranks. Pipeline parallelism passing activations stage to stage.

torch.distributed.send(tensor, dst=next_rank) torch.distributed.recv(tensor, src=prev_rank)

The ring algorithm in the stepper moves 2(N−1)/N of the data per GPU regardless of N — which is why ring AllReduce is bandwidth-optimal for large messages, while latency-bound small messages prefer tree algorithms. NCCL picks per message size; you rarely should.

Communication Libraries

LibraryVendorHardwareTopology-awareBest for
NCCLNVIDIACUDA GPUsYesAnything NVIDIA — the default
RCCLAMDROCm GPUsYesMI300-class clusters
oneCCLIntelIntel GPUs/CPUsYesIntel Data Center GPUs
GlooMetaCPU + GPUNoCPU fallback, development
MPIcommunityeverythingvariesHPC sites with MPI infrastructure
MSCCLMicrosoftCUDA GPUsprogrammableCustom algorithms, research

NCCL detects your topology (NVLink, PCIe, InfiniBand), then picks ring, tree, or SHARP-offloaded algorithms per message. As of 2.27 it also runs SHARP reductions on NVLink fabrics and offers symmetric memory across domains up to a full NVL72.

torch.distributed.init_process_group(backend='nccl')

RCCL is NCCL's API twin for AMD — PyTorch even selects it through the 'nccl' backend name on ROCm. oneCCL covers Intel (backend='ccl' after importing oneccl_bindings_for_pytorch). Gloo (backend='gloo') works everywhere but optimizes nothing — keep it for CPUs and debugging. MPI integrates with schedulers and existing HPC tooling. MSCCL lets you write custom collectives when the built-in algorithms don't fit your topology.

Choosing Your Setup

  1. Vendor picks the library: NVIDIA → NCCL, AMD → RCCL, Intel → oneCCL, mixed or CPU → Gloo/MPI.
  2. Scale picks the fabric: 2 GPUs → PCIe is survivable; 4–8 → you want an NVSwitch node; dozens → NVL72 or multi-node InfiniBand; hundreds+ → multi-node is mandatory, design for the hierarchy.
  3. Consumer hardware reality check: no NVLink since the 3090 — a dual-4090/5090 box is PCIe-only, and that's a constraint to design around, not a bug to fix.

Performance: Living With the Cliff

Bandwidth vs latency. Large transfers (>1 MB) are bandwidth-bound — interconnect choice dominates. Small transfers (<100 KB) are latency-bound — even PCIe looks fine, and the fix is batching, not faster links. DDP's gradient bucketing (bucket_cap_mb, default 25) exists precisely to turn many small AllReduces into few large ones.

Overlap communication with compute. While layer N's gradients sync, layer N−1's are still being computed. Bigger buckets use bandwidth better; smaller buckets start syncing sooner. It's a tunable, not a constant.

Respect the hierarchy. Reduce within the NVLink domain first, cross the network once with the result. DeepSpeed and Megatron do this automatically; manual setups use separate intra- and inter-node process groups.

PracticeWhyHow
Verify topology detectionMisdetection silently costs 10×NCCL_DEBUG=INFO python train.py
Inspect your actual wiringMarketing slides ≠ your motherboardnvidia-smi topo -m
Profile the overlapFind sync stalls visuallynsys profile --trace=cuda,nvtx python train.py
Tune bucketingBalance bandwidth vs overlapDDP(model, bucket_cap_mb=25)
Pin library versionsMixed NCCL versions hang silentlycontainerized environments (NGC)
Gradient compressionCuts cross-node AllReduce traffic when bandwidth-limitedddp_comm_hooks.powerSGD_hook
PitfallSymptomFix
PCIe bottleneckScaling dies beyond 2 consumer GPUsCheck nvidia-smi topo -m; rent NVSwitch nodes
No GPUDirectGradients bounce through CPU memoryConfirm NET/IB (not NET/Socket) in NCCL_DEBUG output
Topology-blind schedulingCluster scatters your 8 GPUs across racksLocality constraints: --gres=gpu:8 --exclusive
Tiny messagesHigh AllReduce latency, idle linksIncrease bucket_cap_mb; fuse small tensors

Further Reading

GPU & High-Performance Computing
NCCL: How NVIDIA Collective Communication Works

A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.

Language & Framework Internals
DataParallel vs DistributedDataParallel

Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.

GPU & High-Performance Computing
Distributed Parallelism in Deep Learning

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

GPU & High-Performance Computing
Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

GPU & High-Performance Computing
CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing
CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

If you found this explanation helpful, consider sharing it with others.

Mastodon