MPI Fundamentals: Message Passing for Distributed Computing

What Is MPI?

The Message Passing Interface (MPI) is the de facto standard for programming distributed-memory parallel systems. Unlike shared-memory models such as OpenMP — where threads read and write the same address space — MPI processes each have their own private memory. The only way to share data is to explicitly send and receive messages over the network.

This design maps directly to HPC cluster hardware: each compute node has its own RAM, and nodes communicate through a high-speed interconnect (InfiniBand, Slingshot, or Ethernet). MPI gives you a portable API to express that communication without worrying about the underlying transport.

MPI is a specification, not an implementation. The standard defines the API; libraries like OpenMPI, MPICH, and Intel MPI provide the actual code. Your program links against whichever implementation is installed on the cluster, and the same source code works everywhere.

The MPI Execution Model

MPI programs follow the SPMD pattern — Single Program, Multiple Data. You compile one binary, and the MPI launcher (mpirun or mpiexec) starts N copies of it across the cluster. Each copy is called a rank, numbered 0 through N−1.

mpirun -np 4 ./my_program
# Launches 4 ranks: rank 0, rank 1, rank 2, rank 3

Every rank executes the same code, but each can query its own rank number and the total world size to decide what to do. Rank 0 might load data and distribute it; other ranks might process their portion and send results back.

The default communicator, MPI_COMM_WORLD, includes all ranks. Think of it as the group chat that every process belongs to at startup. You can create smaller communicators later for sub-group communication.

SPMD Does Not Mean SIMD

Even though every rank runs the same program, each rank can follow a completely different code path based on its rank number. SPMD is about deployment (one binary, many copies), not about lock-step execution.

Compiling and Running

# Compile C code
mpicc -o my_program my_program.c -O2

# Compile C++ code
mpicxx -o my_program my_program.cpp -O2 -std=c++17

# Run with 4 ranks
mpirun -np 4 ./my_program

# Run across multiple nodes (Slurm)
srun --ntasks=8 --nodes=2 ./my_program

# Python with mpi4py (no compilation needed)
mpirun -np 4 python my_program.py

mpirun vs srun

On Slurm clusters, use srun instead of mpirun. Slurm’s srun is MPI-aware and automatically sets up the process layout based on your #SBATCH directives. Using mpirun inside a Slurm job can cause process placement conflicts.

Point-to-Point Communication

The simplest MPI operations involve two ranks: one sends, the other receives.

Blocking: MPI_Send and MPI_Recv

MPI_Send blocks until the message buffer is safe to reuse (the data has been copied to a system buffer or delivered). MPI_Recv blocks until the message has arrived and been written into the receive buffer. Both calls include a tag (an integer label) so the receiver can distinguish between different messages from the same sender.

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, data = 42, buf;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status status;

    if (rank == 0) {
        MPI_Send(&data, 1, MPI_INT, 1, 42, MPI_COMM_WORLD);
        printf("Rank 0 sent %d to Rank 1\n", data);
    } else if (rank == 1) {
        MPI_Recv(&buf, 1, MPI_INT, 0, 42, MPI_COMM_WORLD, &status);
        printf("Rank 1 received %d from Rank 0\n", buf);
    }
    MPI_Finalize();
    return 0;
}

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = 42
    comm.send(data, dest=1, tag=42)
    print(f"Rank 0 sent {data}")
elif rank == 1:
    data = comm.recv(source=0, tag=42)
    print(f"Rank 1 received {data}")

Blocking calls are simple but can waste time: the sender sits idle while the network transfers data, and the receiver sits idle until the message arrives.

Non-Blocking: MPI_Isend and MPI_Irecv

Non-blocking variants return immediately, giving you a request handle. You continue doing useful work — local computation, preparing the next message — and later call MPI_Wait (or MPI_Test) to ensure the operation has completed before touching the buffer again.

MPI_Request req;
MPI_Isend(&data, 1, MPI_INT, 1, 42, MPI_COMM_WORLD, &req);
do_local_computation();  // runs while message is in flight
MPI_Wait(&req, MPI_STATUS_IGNORE);  // now safe to modify data

req = comm.isend(data, dest=1, tag=42)
do_local_computation()
req.wait()

Non-blocking communication is essential for latency hiding: while the network is busy moving bytes, the CPU stays productive. Most high-performance MPI codes use non-blocking calls wherever possible.

Collective Operations

Collectives involve all ranks in a communicator and express common communication patterns in a single call. The MPI runtime can optimize the underlying algorithm (tree-based broadcast, recursive halving for reduce) far better than you could with manual send/recv loops.

MPI_Bcast — one rank sends its data to every other rank
MPI_Scatter — one rank splits an array and distributes pieces to all ranks
MPI_Gather — all ranks send their pieces to one rank, which assembles them
MPI_Reduce — combines values from all ranks using an operation (sum, max, min) and stores the result on one rank
MPI_Allreduce — like Reduce, but every rank gets the final result
MPI_Allgather — like Gather, but every rank gets the full assembled array

The key rule: every rank in the communicator must call the same collective. If rank 2 skips the MPI_Bcast call that ranks 0, 1, and 3 are executing, the program hangs forever.

Communicators and Groups

MPI_COMM_WORLD is convenient, but real applications often need to isolate communication. For example, in a hybrid data-parallel + pipeline-parallel setup, you might want ranks 0–3 to form one pipeline group and ranks 4–7 to form another.

MPI_Comm_split creates new communicators by assigning each rank a color (which group it belongs to) and a key (its ordering within that group). Ranks with the same color end up in the same sub-communicator. Collectives on a sub-communicator only involve its members, so the two pipeline groups can broadcast independently without interfering.

int color = rank / 4;   // ranks 0-3 get color 0, ranks 4-7 get color 1
int key = rank % 4;     // local ordering within group
MPI_Comm sub_comm;
MPI_Comm_split(MPI_COMM_WORLD, color, key, &sub_comm);

int sub_rank, sub_size;
MPI_Comm_rank(sub_comm, &sub_rank);
MPI_Comm_size(sub_comm, &sub_size);
MPI_Bcast(&data, 1, MPI_INT, 0, sub_comm);  // only within group

Sub-communicators are also important for library design: an MPI library can create its own communicator so its internal messages never collide with application-level communication.

Deadlock Patterns

Deadlock is the most common MPI bug. It happens when two or more ranks are waiting for each other to complete an operation that neither can finish.

Classic MPI Deadlock

If Rank 0 calls MPI_Send to Rank 1 and Rank 1 simultaneously calls MPI_Send to Rank 0, both ranks block waiting for the other to post a matching Recv. Neither can proceed — the program hangs silently with no error message.

The classic pattern looks like this:

if (rank == 0) {
    MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);  // BLOCKS
    MPI_Recv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);
} else if (rank == 1) {
    MPI_Send(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);  // BLOCKS
    MPI_Recv(&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}

Three ways to fix it:

Reorder operations — have one rank send first while the other receives first, then swap roles
Use non-blocking calls — MPI_Isend followed by MPI_Recv followed by MPI_Wait avoids the circular dependency
Use MPI_Sendrecv — a combined call that sends and receives simultaneously, designed exactly for this exchange pattern

MPI in Deep Learning

If you’ve used distributed training in PyTorch or TensorFlow, you’ve used MPI concepts even if you never called MPI directly. The gradient synchronization step in data-parallel training — where each GPU computes gradients on its shard and then all GPUs average them — is an AllReduce operation.

NVIDIA’s NCCL (NVIDIA Collective Communications Library) is essentially MPI-style collectives optimized for GPU-to-GPU communication over NVLink and InfiniBand. Horovod, one of the earliest distributed training frameworks, was built directly on top of MPI. PyTorch’s torch.distributed supports MPI, NCCL, and Gloo as backends — all implementing the same collective operation semantics that MPI standardized decades ago.

Understanding MPI fundamentals makes distributed training patterns far more intuitive: when you see dist.all_reduce(tensor) in PyTorch, you know exactly what’s happening at the communication level.

Performance: Latency vs Bandwidth

MPI communication time has two components: latency (fixed per-message overhead) and bandwidth (time proportional to message size). Understanding their interplay is critical for optimizing distributed code.

Practical Guidelines

Message Size	Dominated By	Optimization
< 1 KB	Latency	Batch small messages into fewer large ones
1 KB – 64 KB	Both	Use non-blocking calls to overlap with computation
> 64 KB	Bandwidth	Choose the right network (InfiniBand >> Ethernet)

Communication-Computation Overlap

The key to high-performance MPI code is hiding communication behind computation:

// Post non-blocking receives for halo exchange
MPI_Irecv(ghost_left, n, MPI_DOUBLE, left, 0, comm, &req[0]);
MPI_Irecv(ghost_right, n, MPI_DOUBLE, right, 1, comm, &req[1]);

// Send boundary data
MPI_Isend(boundary_left, n, MPI_DOUBLE, left, 1, comm, &req[2]);
MPI_Isend(boundary_right, n, MPI_DOUBLE, right, 0, comm, &req[3]);

// Compute interior (doesn't need ghost data)
compute_interior(grid);

// Wait for ghost data, then compute boundaries
MPI_Waitall(4, req, MPI_STATUSES_IGNORE);
compute_boundaries(grid, ghost_left, ghost_right);

Hybrid MPI + OpenMP

Modern HPC codes commonly use MPI for inter-node communication and OpenMP for intra-node threading. This matches the hardware: MPI messages cross the network between nodes, while OpenMP threads share memory within a node.

#include <mpi.h>
#include <omp.h>

int main(int argc, char** argv) {
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        compute_chunk(data, tid, omp_get_num_threads());
    }
    // Main thread does MPI communication
    MPI_Allreduce(MPI_IN_PLACE, results, n, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

    MPI_Finalize();
    return 0;
}

Thread Safety Levels

Level	Meaning	Use Case
`MPI_THREAD_SINGLE`	Only one thread exists	Pure MPI, no OpenMP
`MPI_THREAD_FUNNELED`	Only main thread calls MPI	Most hybrid codes
`MPI_THREAD_SERIALIZED`	Any thread can call MPI, one at a time	Rare
`MPI_THREAD_MULTIPLE`	Any thread can call MPI concurrently	Advanced, has performance cost

Slurm Configuration for Hybrid Jobs

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2      # 2 MPI ranks per node
#SBATCH --cpus-per-task=16       # 16 OpenMP threads per rank
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./hybrid_program

Thread Oversubscription

If you run 4 MPI ranks per node with 16 OpenMP threads each on a 32-core node, you get 64 threads competing for 32 cores. Always ensure ntasks-per-node × cpus-per-task ≤ total cores.

Common Pitfalls

1. Tag Mismatches

Every MPI_Send includes a tag, and the matching MPI_Recv must specify the same tag (or use MPI_ANY_TAG). A mismatched tag means the receive never finds its message, and the program hangs. This is especially tricky when multiple message types flow between the same pair of ranks.

2. Buffer Reuse Before Wait

With non-blocking calls, the send buffer must not be modified until MPI_Wait confirms the operation is complete. Writing to the buffer while the MPI library is still reading from it causes data corruption — a bug that may only appear intermittently under load.

# Run with timeout to catch hangs
timeout 60 mpirun -np 4 ./my_program

# OpenMPI: enable debug output
mpirun -np 4 --mca mpi_show_mca_params all ./my_program

# Intel MPI: enable statistics
export I_MPI_DEBUG=5
mpirun -np 4 ./my_program

Checking for Errors

MPI functions return error codes that most programs ignore. In debug builds, check them:

int err = MPI_Send(&data, 1, MPI_INT, dest, tag, comm);
if (err != MPI_SUCCESS) {
    char errstr[MPI_MAX_ERROR_STRING];
    int len;
    MPI_Error_string(err, errstr, &len);
    fprintf(stderr, "Rank %d: MPI_Send failed: %s\n", rank, errstr);
    MPI_Abort(comm, err);
}

Common MPI Errors

Symptom	Cause	Fix
Program hangs silently	Deadlock — mismatched Send/Recv	Use DeadlockSimulator above to find the pattern
`MPI_ERR_RANK`	Sending to rank ≥ world size	Check `MPI_Comm_size` before sending
`MPI_ERR_TRUNCATE`	Receive buffer too small	Match send count with receive count
`MPI_ERR_TAG`	Tag out of range	Tags must be 0 to MPI_TAG_UB
Segfault in MPI_Recv	Buffer not allocated	Allocate receive buffer before calling Recv
Wrong results, no error	Buffer reuse before Wait	Don’t modify Isend buffer until MPI_Wait completes

Key Takeaways

Message passing, not shared memory — MPI processes have private address spaces and communicate by explicitly sending and receiving messages over the network.
Collectives beat manual loops — Broadcast, Scatter, Gather, and AllReduce express common patterns in one call and let the runtime optimize the algorithm.
Non-blocking for performance — Isend/Irecv let you overlap communication with computation, hiding network latency behind useful work.
Deadlock is the classic MPI bug — avoid it by ordering sends and receives carefully, using non-blocking calls, or using MPI_Sendrecv.

Slurm Fundamentals: Job scheduling on HPC clusters — how MPI jobs get allocated to nodes
Distributed Parallelism: Data and model parallelism patterns that use MPI-style collectives
Multi-GPU Communication: Ring-allreduce, tree-allreduce, and other topologies for gradient synchronization
NCCL Communication: NVIDIA’s GPU-optimized collective library — MPI semantics on GPU hardware