Multi-GPU Communication: NVLink, PCIe, and NCCL

A single GPU never waits for a network. The moment you split a model or a batch across two of them, your training speed stops being a compute number and becomes a plumbing number: how fast can device A's gradients reach device B? Synchronizing 1 GB of gradients takes about 0.6 ms over NVLink 5, 16 ms over a PCIe link (64 GB/s), and 80 ms over a 100-gigabit Ethernet link — same model, same math, a 130× spread decided entirely by wiring.

This page is about that wiring: the physical links, the shapes they're arranged in, the collective operations that run over them, and the libraries that pick the algorithms.

The Bandwidth Cliff

Every byte a GPU exchanges lives somewhere on a steep slope. On-package HBM moves terabytes per second; every step away from the die — to a sibling GPU, out of the chassis, across the network — costs roughly an order of magnitude:

The cliff explains nearly every design decision in distributed training. Tensor parallelism stays inside an NVLink domain because it communicates constantly; data parallelism tolerates crossing nodes because it synchronizes once per step; hierarchical AllReduce exists so traffic falls down the cliff exactly once instead of N times. If you remember one picture from this page, make it this one. (For the top of the cliff, see the HBM page.)

Interconnect Technologies

PCIe: the universal floor

Every GPU speaks PCIe — it's how the CPU loads data, and on systems without NVLink it's also how GPUs reach each other.

Generation	Bandwidth (×16, each direction)	Mainstream since
PCIe Gen3	16 GB/s	2012
PCIe Gen4	32 GB/s	2019
PCIe Gen5	64 GB/s	2023
PCIe Gen6	128 GB/s	2026 (servers)

Universal, vendor-neutral — and shared. GPU-to-GPU traffic over PCIe contends with storage and network devices on the same fabric, which is why even Gen6 is the cliff's middle ledge, not its top.

NVLink: the scale-up fabric

NVLink is NVIDIA's point-to-point GPU link. The table lists both the per-link rate and the number that actually matters — the per-GPU aggregate with all links in use:

Generation	GPU	Links × per-link	Per-GPU aggregate
NVLink 1.0	P100 (2016)	4 × 40 GB/s	160 GB/s
NVLink 2.0	V100 (2017)	6 × 50 GB/s	300 GB/s
NVLink 3.0	A100 (2020)	12 × 50 GB/s	600 GB/s
NVLink 4.0	H100 (2022)	18 × 50 GB/s	900 GB/s
NVLink 5.0	B200 (2024–26)	18 × 100 GB/s	1.8 TB/s

All figures bidirectional. Latency sits around a microsecond, and transfers bypass the CPU entirely. The catch: NVLink is NVIDIA-only, and since Ada (RTX 40-series) it's gone from consumer cards entirely — multi-GPU on gaming hardware means PCIe, full stop.

NVSwitch: from node to rack

Point-to-point links don't scale to "everyone talks to everyone." NVSwitch is dedicated switch silicon that gives full bisection bandwidth — every GPU pair communicates at the full per-GPU rate simultaneously.

HGX H100 node: 8 GPUs through 4 NVSwitch chips — 900 GB/s any-to-any.
GB200 NVL72 rack: the 2026 flagship dissolves the node boundary — 72 Blackwell GPUs across 18 compute trays wired through 9 NVLink-switch trays into one NVLink domain with 130 TB/s of aggregate bandwidth. Each NVLink 5 Switch chip carries 144 ports at 14.4 TB/s. The "node" is now the rack.

InfiniBand, RoCE, and EFA: crossing machines

Past the rack, traffic rides the network:

Technology	Per-port rate	In GB/s	Notes
InfiniBand HDR	200 Gb/s	25	Common in A100-era clusters
InfiniBand NDR	400 Gb/s	50	H100-era standard
InfiniBand XDR	800 Gb/s	100	Quantum-X800, Blackwell-era
RoCE v2	100–800 Gb/s	varies	RDMA over (lossless) Ethernet
AWS EFA (v3)	up to 3,200 Gb/s	400	Aggregate per P5en instance

GPUDirect RDMA lets the NIC read and write GPU memory directly — no CPU bounce buffer — and modern training nodes pair one NIC per GPU to widen the pipe. Switch-side, Quantum-X800 adds SHARP in-network reduction: the switch itself sums tensors mid-flight, and NCCL 2.27 uses it on both InfiniBand and NVLink fabrics.

GPU Topologies

How your GPUs are wired sets your scaling ceiling before you write a line of code. Explore the four shapes that matter — click any device or link:

Four regimes, in ascending order of money:

Consumer PCIe box — two cards sharing the PCIe fabric. Fine for experimentation; AllReduce tops out at the slot bandwidth, and there is no NVLink escape hatch on 40/50-series cards.
HGX 8-GPU node — the workhorse. NVSwitch makes topology invisible inside the node: no lucky or unlucky GPU pairs.
GB200 NVL72 rack — 72 GPUs, one NVLink domain. Models that needed multi-node tensor parallelism now fit inside a single fabric running at NVLink speed.
Multi-node cluster — NVLink inside each node or rack, InfiniBand/EFA between them, and an ~18× bandwidth gap at the boundary (1.8 TB/s in-rack vs 100 GB/s per XDR port) that every framework works around with hierarchy.

Collective Operations

Distributed training speaks a small vocabulary of collectives. Watch each one move actual chunks around a 4-GPU ring:

What each is for, with the PyTorch one-liner:

AllReduce — everyone contributes, everyone gets the sum. The heartbeat of data parallelism; DDP calls it on your gradients after every backward pass.

torch.distributed.all_reduce(tensor, op=dist.ReduceOp.SUM)

Broadcast — one rank sends, all receive. Weight initialization at startup.

torch.distributed.broadcast(tensor, src=0)

AllGather — everyone shares their shard, everyone ends with all shards. Tensor-parallel activations; ZeRO parameter gathering.

torch.distributed.all_gather(tensor_list, tensor)

ReduceScatter — sum like AllReduce, but each rank keeps only its slice. ZeRO gradient sharding; exactly half of a ring AllReduce.

torch.distributed.reduce_scatter(output, input_list)

Point-to-point — direct send/recv between two ranks. Pipeline parallelism passing activations stage to stage.

torch.distributed.send(tensor, dst=next_rank)
torch.distributed.recv(tensor, src=prev_rank)

The ring algorithm in the stepper moves 2(N−1)/N of the data per GPU regardless of N — which is why ring AllReduce is bandwidth-optimal for large messages, while latency-bound small messages prefer tree algorithms. NCCL picks per message size; you rarely should.

Communication Libraries

Library	Vendor	Hardware	Topology-aware	Best for
NCCL	NVIDIA	CUDA GPUs	Yes	Anything NVIDIA — the default
RCCL	AMD	ROCm GPUs	Yes	MI300-class clusters
oneCCL	Intel	Intel GPUs/CPUs	Yes	Intel Data Center GPUs
Gloo	Meta	CPU + GPU	No	CPU fallback, development
MPI	community	everything	varies	HPC sites with MPI infrastructure
MSCCL	Microsoft	CUDA GPUs	programmable	Custom algorithms, research

NCCL detects your topology (NVLink, PCIe, InfiniBand), then picks ring, tree, or SHARP-offloaded algorithms per message. As of 2.27 it also runs SHARP reductions on NVLink fabrics and offers symmetric memory across domains up to a full NVL72.

torch.distributed.init_process_group(backend='nccl')

RCCL is NCCL's API twin for AMD — PyTorch even selects it through the 'nccl' backend name on ROCm. oneCCL covers Intel (backend='ccl' after importing oneccl_bindings_for_pytorch). Gloo (backend='gloo') works everywhere but optimizes nothing — keep it for CPUs and debugging. MPI integrates with schedulers and existing HPC tooling. MSCCL lets you write custom collectives when the built-in algorithms don't fit your topology.

Choosing Your Setup

Vendor picks the library: NVIDIA → NCCL, AMD → RCCL, Intel → oneCCL, mixed or CPU → Gloo/MPI.
Scale picks the fabric: 2 GPUs → PCIe is survivable; 4–8 → you want an NVSwitch node; dozens → NVL72 or multi-node InfiniBand; hundreds+ → multi-node is mandatory, design for the hierarchy.
Consumer hardware reality check: no NVLink since the 3090 — a dual-4090/5090 box is PCIe-only, and that's a constraint to design around, not a bug to fix.

Performance: Living With the Cliff

Bandwidth vs latency. Large transfers (>1 MB) are bandwidth-bound — interconnect choice dominates. Small transfers (<100 KB) are latency-bound — even PCIe looks fine, and the fix is batching, not faster links. DDP's gradient bucketing (bucket_cap_mb, default 25) exists precisely to turn many small AllReduces into few large ones.

Overlap communication with compute. While layer N's gradients sync, layer N−1's are still being computed. Bigger buckets use bandwidth better; smaller buckets start syncing sooner. It's a tunable, not a constant.

Respect the hierarchy. Reduce within the NVLink domain first, cross the network once with the result. DeepSpeed and Megatron do this automatically; manual setups use separate intra- and inter-node process groups.

Practice	Why	How
Verify topology detection	Misdetection silently costs 10×	`NCCL_DEBUG=INFO python train.py`
Inspect your actual wiring	Marketing slides ≠ your motherboard	`nvidia-smi topo -m`
Profile the overlap	Find sync stalls visually	`nsys profile --trace=cuda,nvtx python train.py`
Tune bucketing	Balance bandwidth vs overlap	`DDP(model, bucket_cap_mb=25)`
Pin library versions	Mixed NCCL versions hang silently	containerized environments (NGC)
Gradient compression	Cuts cross-node AllReduce traffic when bandwidth-limited	`ddp_comm_hooks.powerSGD_hook`

Pitfall	Symptom	Fix
PCIe bottleneck	Scaling dies beyond 2 consumer GPUs	Check `nvidia-smi topo -m`; rent NVSwitch nodes
No GPUDirect	Gradients bounce through CPU memory	Confirm `NET/IB` (not `NET/Socket`) in NCCL_DEBUG output
Topology-blind scheduling	Cluster scatters your 8 GPUs across racks	Locality constraints: `--gres=gpu:8 --exclusive`
Tiny messages	High AllReduce latency, idle links	Increase `bucket_cap_mb`; fuse small tensors

Key Takeaways

Essential Multi-GPU Concepts

• The cliff: every hop from the die costs ~10× bandwidth

• NVLink 5: 1.8 TB/s per GPU; the NVL72 rack is one 72-GPU domain

• Consumer reality: no NVLink since the 3090 — PCIe only

• Ring AllReduce: 2(N−1)/N data per GPU — bandwidth-optimal

• Hierarchy: reduce inside the NVLink domain, cross the network once

• NCCL: topology-aware by default — verify with NCCL_DEBUG=INFO

• Small messages: latency-bound — batch them, don't buy links

• In-network compute: SHARP sums tensors inside the switch itself

Multi-GPU performance is a plumbing problem wearing a math costume. The compute is identical on every GPU — what separates a 95%-efficient cluster from a 40%-efficient one is how rarely its bytes fall down the bandwidth cliff, and how gracefully they land when they must.