Skip to main content

MPI Fundamentals: Message Passing for Distributed Computing

Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.

Best viewed on desktop for optimal interactive experience

What Is MPI?

The Message Passing Interface (MPI) is the de facto standard for programming distributed-memory parallel systems. Unlike shared-memory models such as OpenMP — where threads read and write the same address space — MPI processes each have their own private memory. The only way to share data is to explicitly send and receive messages over the network.

This design maps directly to HPC cluster hardware: each compute node has its own RAM, and nodes communicate through a high-speed interconnect (InfiniBand, Slingshot, or Ethernet). MPI gives you a portable API to express that communication without worrying about the underlying transport.

MPI is a specification, not an implementation. The standard defines the API; libraries like OpenMPI, MPICH, and Intel MPI provide the actual code. Your program links against whichever implementation is installed on the cluster, and the same source code works everywhere.

The MPI Execution Model

MPI programs follow the SPMD pattern — Single Program, Multiple Data. You compile one binary, and the MPI launcher (mpirun or mpiexec) starts N copies of it across the cluster. Each copy is called a rank, numbered 0 through N−1.

mpirun -np 4 ./my_program # Launches 4 ranks: rank 0, rank 1, rank 2, rank 3

Every rank executes the same code, but each can query its own rank number and the total world size to decide what to do. Rank 0 might load data and distribute it; other ranks might process their portion and send results back.

The default communicator, MPI_COMM_WORLD, includes all ranks. Think of it as the group chat that every process belongs to at startup. You can create smaller communicators later for sub-group communication.

SPMD Does Not Mean SIMD

Even though every rank runs the same program, each rank can follow a completely different code path based on its rank number. SPMD is about deployment (one binary, many copies), not about lock-step execution.

Compiling and Running

# Compile C code mpicc -o my_program my_program.c -O2 # Compile C++ code mpicxx -o my_program my_program.cpp -O2 -std=c++17 # Run with 4 ranks mpirun -np 4 ./my_program # Run across multiple nodes (Slurm) srun --ntasks=8 --nodes=2 ./my_program # Python with mpi4py (no compilation needed) mpirun -np 4 python my_program.py

mpirun vs srun

On Slurm clusters, use srun instead of mpirun. Slurm’s srun is MPI-aware and automatically sets up the process layout based on your #SBATCH directives. Using mpirun inside a Slurm job can cause process placement conflicts.

Point-to-Point Communication

The simplest MPI operations involve two ranks: one sends, the other receives.

Blocking: MPI_Send and MPI_Recv

MPI_Send blocks until the message buffer is safe to reuse (the data has been copied to a system buffer or delivered). MPI_Recv blocks until the message has arrived and been written into the receive buffer. Both calls include a tag (an integer label) so the receiver can distinguish between different messages from the same sender.

#include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); int rank, data = 42, buf; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Status status; if (rank == 0) { MPI_Send(&data, 1, MPI_INT, 1, 42, MPI_COMM_WORLD); printf("Rank 0 sent %d to Rank 1\n", data); } else if (rank == 1) { MPI_Recv(&buf, 1, MPI_INT, 0, 42, MPI_COMM_WORLD, &status); printf("Rank 1 received %d from Rank 0\n", buf); } MPI_Finalize(); return 0; }
from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0: data = 42 comm.send(data, dest=1, tag=42) print(f"Rank 0 sent {data}") elif rank == 1: data = comm.recv(source=0, tag=42) print(f"Rank 1 received {data}")

Blocking calls are simple but can waste time: the sender sits idle while the network transfers data, and the receiver sits idle until the message arrives.

Non-Blocking: MPI_Isend and MPI_Irecv

Non-blocking variants return immediately, giving you a request handle. You continue doing useful work — local computation, preparing the next message — and later call MPI_Wait (or MPI_Test) to ensure the operation has completed before touching the buffer again.

MPI_Request req; MPI_Isend(&data, 1, MPI_INT, 1, 42, MPI_COMM_WORLD, &req); do_local_computation(); // runs while message is in flight MPI_Wait(&req, MPI_STATUS_IGNORE); // now safe to modify data
req = comm.isend(data, dest=1, tag=42) do_local_computation() req.wait()

Non-blocking communication is essential for latency hiding: while the network is busy moving bytes, the CPU stays productive. Most high-performance MPI codes use non-blocking calls wherever possible.

Collective Operations

Collectives involve all ranks in a communicator and express common communication patterns in a single call. The MPI runtime can optimize the underlying algorithm (tree-based broadcast, recursive halving for reduce) far better than you could with manual send/recv loops.

  • MPI_Bcast — one rank sends its data to every other rank
  • MPI_Scatter — one rank splits an array and distributes pieces to all ranks
  • MPI_Gather — all ranks send their pieces to one rank, which assembles them
  • MPI_Reduce — combines values from all ranks using an operation (sum, max, min) and stores the result on one rank
  • MPI_Allreduce — like Reduce, but every rank gets the final result
  • MPI_Allgather — like Gather, but every rank gets the full assembled array

The key rule: every rank in the communicator must call the same collective. If rank 2 skips the MPI_Bcast call that ranks 0, 1, and 3 are executing, the program hangs forever.

Communicators and Groups

MPI_COMM_WORLD is convenient, but real applications often need to isolate communication. For example, in a hybrid data-parallel + pipeline-parallel setup, you might want ranks 0–3 to form one pipeline group and ranks 4–7 to form another.

MPI_Comm_split creates new communicators by assigning each rank a color (which group it belongs to) and a key (its ordering within that group). Ranks with the same color end up in the same sub-communicator. Collectives on a sub-communicator only involve its members, so the two pipeline groups can broadcast independently without interfering.

int color = rank / 4; // ranks 0-3 get color 0, ranks 4-7 get color 1 int key = rank % 4; // local ordering within group MPI_Comm sub_comm; MPI_Comm_split(MPI_COMM_WORLD, color, key, &sub_comm); int sub_rank, sub_size; MPI_Comm_rank(sub_comm, &sub_rank); MPI_Comm_size(sub_comm, &sub_size); MPI_Bcast(&data, 1, MPI_INT, 0, sub_comm); // only within group

Sub-communicators are also important for library design: an MPI library can create its own communicator so its internal messages never collide with application-level communication.

Deadlock Patterns

Deadlock is the most common MPI bug. It happens when two or more ranks are waiting for each other to complete an operation that neither can finish.

Classic MPI Deadlock

If Rank 0 calls MPI_Send to Rank 1 and Rank 1 simultaneously calls MPI_Send to Rank 0, both ranks block waiting for the other to post a matching Recv. Neither can proceed — the program hangs silently with no error message.

The classic pattern looks like this:

if (rank == 0) { MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); // BLOCKS MPI_Recv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &status); } else if (rank == 1) { MPI_Send(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); // BLOCKS MPI_Recv(&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); }

Three ways to fix it:

  1. Reorder operations — have one rank send first while the other receives first, then swap roles
  2. Use non-blocking callsMPI_Isend followed by MPI_Recv followed by MPI_Wait avoids the circular dependency
  3. Use MPI_Sendrecv — a combined call that sends and receives simultaneously, designed exactly for this exchange pattern

MPI in Deep Learning

If you’ve used distributed training in PyTorch or TensorFlow, you’ve used MPI concepts even if you never called MPI directly. The gradient synchronization step in data-parallel training — where each GPU computes gradients on its shard and then all GPUs average them — is an AllReduce operation.

NVIDIA’s NCCL (NVIDIA Collective Communications Library) is essentially MPI-style collectives optimized for GPU-to-GPU communication over NVLink and InfiniBand. Horovod, one of the earliest distributed training frameworks, was built directly on top of MPI. PyTorch’s torch.distributed supports MPI, NCCL, and Gloo as backends — all implementing the same collective operation semantics that MPI standardized decades ago.

Understanding MPI fundamentals makes distributed training patterns far more intuitive: when you see dist.all_reduce(tensor) in PyTorch, you know exactly what’s happening at the communication level.

Performance: Latency vs Bandwidth

MPI communication time has two components: latency (fixed per-message overhead) and bandwidth (time proportional to message size). Understanding their interplay is critical for optimizing distributed code.

Practical Guidelines

Message SizeDominated ByOptimization
< 1 KBLatencyBatch small messages into fewer large ones
1 KB – 64 KBBothUse non-blocking calls to overlap with computation
> 64 KBBandwidthChoose the right network (InfiniBand >> Ethernet)

Communication-Computation Overlap

The key to high-performance MPI code is hiding communication behind computation:

// Post non-blocking receives for halo exchange MPI_Irecv(ghost_left, n, MPI_DOUBLE, left, 0, comm, &req[0]); MPI_Irecv(ghost_right, n, MPI_DOUBLE, right, 1, comm, &req[1]); // Send boundary data MPI_Isend(boundary_left, n, MPI_DOUBLE, left, 1, comm, &req[2]); MPI_Isend(boundary_right, n, MPI_DOUBLE, right, 0, comm, &req[3]); // Compute interior (doesn't need ghost data) compute_interior(grid); // Wait for ghost data, then compute boundaries MPI_Waitall(4, req, MPI_STATUSES_IGNORE); compute_boundaries(grid, ghost_left, ghost_right);

Hybrid MPI + OpenMP

Modern HPC codes commonly use MPI for inter-node communication and OpenMP for intra-node threading. This matches the hardware: MPI messages cross the network between nodes, while OpenMP threads share memory within a node.

#include <mpi.h> #include <omp.h> int main(int argc, char** argv) { int provided; MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); #pragma omp parallel { int tid = omp_get_thread_num(); compute_chunk(data, tid, omp_get_num_threads()); } // Main thread does MPI communication MPI_Allreduce(MPI_IN_PLACE, results, n, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize(); return 0; }

Thread Safety Levels

LevelMeaningUse Case
MPI_THREAD_SINGLEOnly one thread existsPure MPI, no OpenMP
MPI_THREAD_FUNNELEDOnly main thread calls MPIMost hybrid codes
MPI_THREAD_SERIALIZEDAny thread can call MPI, one at a timeRare
MPI_THREAD_MULTIPLEAny thread can call MPI concurrentlyAdvanced, has performance cost

Slurm Configuration for Hybrid Jobs

#SBATCH --nodes=4 #SBATCH --ntasks-per-node=2 # 2 MPI ranks per node #SBATCH --cpus-per-task=16 # 16 OpenMP threads per rank export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./hybrid_program

Thread Oversubscription

If you run 4 MPI ranks per node with 16 OpenMP threads each on a 32-core node, you get 64 threads competing for 32 cores. Always ensure ntasks-per-node × cpus-per-task ≤ total cores.

Common Pitfalls

1. Tag Mismatches

Every MPI_Send includes a tag, and the matching MPI_Recv must specify the same tag (or use MPI_ANY_TAG). A mismatched tag means the receive never finds its message, and the program hangs. This is especially tricky when multiple message types flow between the same pair of ranks.

2. Buffer Reuse Before Wait

With non-blocking calls, the send buffer must not be modified until MPI_Wait confirms the operation is complete. Writing to the buffer while the MPI library is still reading from it causes data corruption — a bug that may only appear intermittently under load.

3. Not All Ranks Calling a Collective

Collectives are synchronization points. If rank 3 takes a different code path and skips an MPI_Bcast that the other ranks call, the program deadlocks. Every collective must be reached by every rank in the communicator, in the same order.

4. Forgetting MPI_Finalize

MPI_Finalize tells the runtime to clean up resources and flush pending messages. Skipping it can cause message loss, zombie processes, or job scheduler errors. Always call it before exiting, even in error-handling paths.

Debugging MPI Programs

Detecting Deadlocks

# Run with timeout to catch hangs timeout 60 mpirun -np 4 ./my_program # OpenMPI: enable debug output mpirun -np 4 --mca mpi_show_mca_params all ./my_program # Intel MPI: enable statistics export I_MPI_DEBUG=5 mpirun -np 4 ./my_program

Checking for Errors

MPI functions return error codes that most programs ignore. In debug builds, check them:

int err = MPI_Send(&data, 1, MPI_INT, dest, tag, comm); if (err != MPI_SUCCESS) { char errstr[MPI_MAX_ERROR_STRING]; int len; MPI_Error_string(err, errstr, &len); fprintf(stderr, "Rank %d: MPI_Send failed: %s\n", rank, errstr); MPI_Abort(comm, err); }

Common MPI Errors

SymptomCauseFix
Program hangs silentlyDeadlock — mismatched Send/RecvUse DeadlockSimulator above to find the pattern
MPI_ERR_RANKSending to rank ≥ world sizeCheck MPI_Comm_size before sending
MPI_ERR_TRUNCATEReceive buffer too smallMatch send count with receive count
MPI_ERR_TAGTag out of rangeTags must be 0 to MPI_TAG_UB
Segfault in MPI_RecvBuffer not allocatedAllocate receive buffer before calling Recv
Wrong results, no errorBuffer reuse before WaitDon’t modify Isend buffer until MPI_Wait completes

Key Takeaways

  1. Message passing, not shared memory — MPI processes have private address spaces and communicate by explicitly sending and receiving messages over the network.

  2. Collectives beat manual loops — Broadcast, Scatter, Gather, and AllReduce express common patterns in one call and let the runtime optimize the algorithm.

  3. Non-blocking for performance — Isend/Irecv let you overlap communication with computation, hiding network latency behind useful work.

  4. Deadlock is the classic MPI bug — avoid it by ordering sends and receives carefully, using non-blocking calls, or using MPI_Sendrecv.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon