A single GPU never waits for a network. The moment you split a model or a batch across two of them, your training speed stops being a compute number and becomes a plumbing number: how fast can device A's gradients reach device B? Synchronizing 1 GB of gradients takes about 0.6 ms over NVLink 5, 16 ms over a PCIe link (64 GB/s), and 80 ms over a 100-gigabit Ethernet link — same model, same math, a 130× spread decided entirely by wiring.
This page is about that wiring: the physical links, the shapes they're arranged in, the collective operations that run over them, and the libraries that pick the algorithms.
The Bandwidth Cliff
Every byte a GPU exchanges lives somewhere on a steep slope. On-package HBM moves terabytes per second; every step away from the die — to a sibling GPU, out of the chassis, across the network — costs roughly an order of magnitude:
The cliff explains nearly every design decision in distributed training. Tensor parallelism stays inside an NVLink domain because it communicates constantly; data parallelism tolerates crossing nodes because it synchronizes once per step; hierarchical AllReduce exists so traffic falls down the cliff exactly once instead of N times. If you remember one picture from this page, make it this one. (For the top of the cliff, see the HBM page.)
Interconnect Technologies
PCIe: the universal floor
Every GPU speaks PCIe — it's how the CPU loads data, and on systems without NVLink it's also how GPUs reach each other.
| Generation | Bandwidth (×16, each direction) | Mainstream since |
|---|---|---|
| PCIe Gen3 | 16 GB/s | 2012 |
| PCIe Gen4 | 32 GB/s | 2019 |
| PCIe Gen5 | 64 GB/s | 2023 |
| PCIe Gen6 | 128 GB/s | 2026 (servers) |
Universal, vendor-neutral — and shared. GPU-to-GPU traffic over PCIe contends with storage and network devices on the same fabric, which is why even Gen6 is the cliff's middle ledge, not its top.
NVLink: the scale-up fabric
NVLink is NVIDIA's point-to-point GPU link. The table lists both the per-link rate and the number that actually matters — the per-GPU aggregate with all links in use:
| Generation | GPU | Links × per-link | Per-GPU aggregate |
|---|---|---|---|
| NVLink 1.0 | P100 (2016) | 4 × 40 GB/s | 160 GB/s |
| NVLink 2.0 | V100 (2017) | 6 × 50 GB/s | 300 GB/s |
| NVLink 3.0 | A100 (2020) | 12 × 50 GB/s | 600 GB/s |
| NVLink 4.0 | H100 (2022) | 18 × 50 GB/s | 900 GB/s |
| NVLink 5.0 | B200 (2024–26) | 18 × 100 GB/s | 1.8 TB/s |
All figures bidirectional. Latency sits around a microsecond, and transfers bypass the CPU entirely. The catch: NVLink is NVIDIA-only, and since Ada (RTX 40-series) it's gone from consumer cards entirely — multi-GPU on gaming hardware means PCIe, full stop.
NVSwitch: from node to rack
Point-to-point links don't scale to "everyone talks to everyone." NVSwitch is dedicated switch silicon that gives full bisection bandwidth — every GPU pair communicates at the full per-GPU rate simultaneously.
- HGX H100 node: 8 GPUs through 4 NVSwitch chips — 900 GB/s any-to-any.
- GB200 NVL72 rack: the 2026 flagship dissolves the node boundary — 72 Blackwell GPUs across 18 compute trays wired through 9 NVLink-switch trays into one NVLink domain with 130 TB/s of aggregate bandwidth. Each NVLink 5 Switch chip carries 144 ports at 14.4 TB/s. The "node" is now the rack.
InfiniBand, RoCE, and EFA: crossing machines
Past the rack, traffic rides the network:
| Technology | Per-port rate | In GB/s | Notes |
|---|---|---|---|
| InfiniBand HDR | 200 Gb/s | 25 | Common in A100-era clusters |
| InfiniBand NDR | 400 Gb/s | 50 | H100-era standard |
| InfiniBand XDR | 800 Gb/s | 100 | Quantum-X800, Blackwell-era |
| RoCE v2 | 100–800 Gb/s | varies | RDMA over (lossless) Ethernet |
| AWS EFA (v3) | up to 3,200 Gb/s | 400 | Aggregate per P5en instance |
GPUDirect RDMA lets the NIC read and write GPU memory directly — no CPU bounce buffer — and modern training nodes pair one NIC per GPU to widen the pipe. Switch-side, Quantum-X800 adds SHARP in-network reduction: the switch itself sums tensors mid-flight, and NCCL 2.27 uses it on both InfiniBand and NVLink fabrics.
GPU Topologies
How your GPUs are wired sets your scaling ceiling before you write a line of code. Explore the four shapes that matter — click any device or link:
Four regimes, in ascending order of money:
- Consumer PCIe box — two cards sharing the PCIe fabric. Fine for experimentation; AllReduce tops out at the slot bandwidth, and there is no NVLink escape hatch on 40/50-series cards.
- HGX 8-GPU node — the workhorse. NVSwitch makes topology invisible inside the node: no lucky or unlucky GPU pairs.
- GB200 NVL72 rack — 72 GPUs, one NVLink domain. Models that needed multi-node tensor parallelism now fit inside a single fabric running at NVLink speed.
- Multi-node cluster — NVLink inside each node or rack, InfiniBand/EFA between them, and an ~18× bandwidth gap at the boundary (1.8 TB/s in-rack vs 100 GB/s per XDR port) that every framework works around with hierarchy.
Collective Operations
Distributed training speaks a small vocabulary of collectives. Watch each one move actual chunks around a 4-GPU ring:
What each is for, with the PyTorch one-liner:
AllReduce — everyone contributes, everyone gets the sum. The heartbeat of data parallelism; DDP calls it on your gradients after every backward pass.
torch.distributed.all_reduce(tensor, op=dist.ReduceOp.SUM)
Broadcast — one rank sends, all receive. Weight initialization at startup.
torch.distributed.broadcast(tensor, src=0)
AllGather — everyone shares their shard, everyone ends with all shards. Tensor-parallel activations; ZeRO parameter gathering.
torch.distributed.all_gather(tensor_list, tensor)
ReduceScatter — sum like AllReduce, but each rank keeps only its slice. ZeRO gradient sharding; exactly half of a ring AllReduce.
torch.distributed.reduce_scatter(output, input_list)
Point-to-point — direct send/recv between two ranks. Pipeline parallelism passing activations stage to stage.
torch.distributed.send(tensor, dst=next_rank) torch.distributed.recv(tensor, src=prev_rank)
The ring algorithm in the stepper moves 2(N−1)/N of the data per GPU regardless of N — which is why ring AllReduce is bandwidth-optimal for large messages, while latency-bound small messages prefer tree algorithms. NCCL picks per message size; you rarely should.
Communication Libraries
| Library | Vendor | Hardware | Topology-aware | Best for |
|---|---|---|---|---|
| NCCL | NVIDIA | CUDA GPUs | Yes | Anything NVIDIA — the default |
| RCCL | AMD | ROCm GPUs | Yes | MI300-class clusters |
| oneCCL | Intel | Intel GPUs/CPUs | Yes | Intel Data Center GPUs |
| Gloo | Meta | CPU + GPU | No | CPU fallback, development |
| MPI | community | everything | varies | HPC sites with MPI infrastructure |
| MSCCL | Microsoft | CUDA GPUs | programmable | Custom algorithms, research |
NCCL detects your topology (NVLink, PCIe, InfiniBand), then picks ring, tree, or SHARP-offloaded algorithms per message. As of 2.27 it also runs SHARP reductions on NVLink fabrics and offers symmetric memory across domains up to a full NVL72.
torch.distributed.init_process_group(backend='nccl')
RCCL is NCCL's API twin for AMD — PyTorch even selects it through the 'nccl' backend name on ROCm. oneCCL covers Intel (backend='ccl' after importing oneccl_bindings_for_pytorch). Gloo (backend='gloo') works everywhere but optimizes nothing — keep it for CPUs and debugging. MPI integrates with schedulers and existing HPC tooling. MSCCL lets you write custom collectives when the built-in algorithms don't fit your topology.
Choosing Your Setup
- Vendor picks the library: NVIDIA → NCCL, AMD → RCCL, Intel → oneCCL, mixed or CPU → Gloo/MPI.
- Scale picks the fabric: 2 GPUs → PCIe is survivable; 4–8 → you want an NVSwitch node; dozens → NVL72 or multi-node InfiniBand; hundreds+ → multi-node is mandatory, design for the hierarchy.
- Consumer hardware reality check: no NVLink since the 3090 — a dual-4090/5090 box is PCIe-only, and that's a constraint to design around, not a bug to fix.
Performance: Living With the Cliff
Bandwidth vs latency. Large transfers (>1 MB) are bandwidth-bound — interconnect choice dominates. Small transfers (<100 KB) are latency-bound — even PCIe looks fine, and the fix is batching, not faster links. DDP's gradient bucketing (bucket_cap_mb, default 25) exists precisely to turn many small AllReduces into few large ones.
Overlap communication with compute. While layer N's gradients sync, layer N−1's are still being computed. Bigger buckets use bandwidth better; smaller buckets start syncing sooner. It's a tunable, not a constant.
Respect the hierarchy. Reduce within the NVLink domain first, cross the network once with the result. DeepSpeed and Megatron do this automatically; manual setups use separate intra- and inter-node process groups.
| Practice | Why | How |
|---|---|---|
| Verify topology detection | Misdetection silently costs 10× | NCCL_DEBUG=INFO python train.py |
| Inspect your actual wiring | Marketing slides ≠ your motherboard | nvidia-smi topo -m |
| Profile the overlap | Find sync stalls visually | nsys profile --trace=cuda,nvtx python train.py |
| Tune bucketing | Balance bandwidth vs overlap | DDP(model, bucket_cap_mb=25) |
| Pin library versions | Mixed NCCL versions hang silently | containerized environments (NGC) |
| Gradient compression | Cuts cross-node AllReduce traffic when bandwidth-limited | ddp_comm_hooks.powerSGD_hook |
| Pitfall | Symptom | Fix |
|---|---|---|
| PCIe bottleneck | Scaling dies beyond 2 consumer GPUs | Check nvidia-smi topo -m; rent NVSwitch nodes |
| No GPUDirect | Gradients bounce through CPU memory | Confirm NET/IB (not NET/Socket) in NCCL_DEBUG output |
| Topology-blind scheduling | Cluster scatters your 8 GPUs across racks | Locality constraints: --gres=gpu:8 --exclusive |
| Tiny messages | High AllReduce latency, idle links | Increase bucket_cap_mb; fuse small tensors |
Further Reading
- NCCL Documentation - algorithms, environment variables, and tuning straight from the source
- NVIDIA GB200 NVL72 - the rack-as-one-GPU architecture in detail
- PyTorch Distributed notes - how DDP buckets, overlaps, and calls the collectives on this page
Related concepts
A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.
Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.
GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
