Skip to main content

OpenMP: Shared-Memory Parallel Programming

OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.

Best viewed on desktop for optimal interactive experience

OpenMP

What Is OpenMP?

OpenMP (Open Multi-Processing) is a pragma-based API for shared-memory parallel programming in C, C++, and Fortran. Instead of manually spawning threads and managing synchronization primitives, you annotate existing sequential code with compiler directives and the runtime handles thread creation, work distribution, and teardown.

The key distinction from other parallel programming models: MPI targets distributed memory across networked nodes, CUDA targets GPU execution, and OpenMP targets the cores on a single shared-memory machine. In practice, large HPC workloads combine all three — OpenMP for intra-node parallelism, MPI for inter-node communication, and CUDA or OpenMP target directives for GPU offloading.

OpenMP was first standardized in 1997 for Fortran, with C/C++ support arriving in 1998. The specification has evolved through multiple revisions — the current version is 6.0, ratified in November 2024. Every major compiler supports it: GCC, Clang, MSVC, Intel oneAPI, and NVIDIA HPC SDK. Enabling it is typically a single compiler flag: -fopenmp for GCC/Clang or /openmp for MSVC.

The Fork-Join Model

OpenMP’s execution model is fork-join. A program starts as a single initial thread (historically called "master thread," deprecated in OpenMP 5.1 in favor of masked). When it encounters a #pragma omp parallel directive, the runtime forks a team of threads that execute the parallel region concurrently. At the end of the region, all threads join back — an implicit barrier synchronizes them — and execution continues on the master thread alone.

#include <omp.h> #include <cstdio> int main() { printf("Serial: thread %d\n", omp_get_thread_num()); #pragma omp parallel { int tid = omp_get_thread_num(); int nthreads = omp_get_num_threads(); printf("Parallel: thread %d of %d\n", tid, nthreads); } // implicit barrier here — all threads join printf("Serial again: thread %d\n", omp_get_thread_num()); return 0; }

The number of threads defaults to the number of available cores but can be controlled via the OMP_NUM_THREADS environment variable, the num_threads(N) clause, or runtime calls like omp_set_num_threads().

The theoretical speedup from parallelizing a program is bounded by Amdahl’s Law. If f is the fraction of the program that must remain serial, then the maximum speedup with N threads is:

S(N) = 1f + 1 - fN

Even with infinite threads, the speedup is capped at 1/f. If 10% of your code is serial, maximum speedup is 10x regardless of thread count. For a deeper treatment of parallel scaling limits, see Flynn’s Classification.

Experiment with the fork-join model below. Adjust the thread count and toggle nested parallelism to see how the initial thread forks a parallel team and joins at the barrier.

Work-Sharing Constructs

A #pragma omp parallel region by itself just creates threads that all execute the same code. Work-sharing constructs divide work among the threads in a team so each thread handles a distinct portion.

parallel for

The most common construct. Distributes loop iterations across threads:

#pragma omp parallel for for (int i = 0; i < N; i++) { result[i] = compute(data[i]); }

The iteration space [0, N) is partitioned among threads. By default, each thread gets a contiguous block of roughly N / T iterations, where T is the thread count.

sections

Assigns distinct code blocks to different threads. Useful when you have a small number of independent tasks rather than a loop:

#pragma omp parallel sections { #pragma omp section { load_data(); } #pragma omp section { initialize_model(); } #pragma omp section { prepare_logging(); } }

single and master

single ensures a block runs on exactly one thread (whichever arrives first), with an implicit barrier at the end. master restricts execution to thread 0 with no barrier.

task and taskloop

Tasks enable parallelism for irregular and recursive workloads that don’t fit the loop model. The taskloop construct combines tasks with loop iteration:

#pragma omp parallel #pragma omp single { #pragma omp taskloop grainsize(100) for (int i = 0; i < N; i++) { process(items[i]); } }

Compare how different scheduling strategies distribute work across threads. Adjust the chunk size and watch how static, dynamic, and guided handle uneven workloads.

Data Environment

Every variable in an OpenMP parallel region is either shared (one copy, all threads see the same memory) or private (each thread gets its own copy). Getting this wrong is the primary source of OpenMP bugs.

Data-Sharing Clauses

  • shared(x) — All threads read and write the same variable. Default for variables declared outside the parallel region.
  • private(x) — Each thread gets an uninitialized copy. The original variable is unchanged after the region.
  • firstprivate(x) — Like private, but each copy is initialized with the original value.
  • lastprivate(x) — Like private, but the value from the thread executing the last iteration is copied back to the original.
  • reduction(op:x) — Each thread gets a private copy initialized to the identity element for op, and results are combined at the end.

Reduction Example

Computing a sum without reduction requires manual synchronization. With reduction, OpenMP handles everything:

double total = 0.0; #pragma omp parallel for reduction(+:total) for (int i = 0; i < N; i++) { total += data[i]; } // total now contains the complete sum

The runtime creates thread-local accumulators, each initialized to 0.0, and combines them with + at the barrier. No locks, no atomics, no data races.

Common pitfall: loop variables in parallel for are implicitly private, but other variables you modify inside the loop are shared by default. Forgetting to mark a scratch variable as private is a classic source of silent data corruption.

Synchronization

When multiple threads access shared data and at least one writes, you need synchronization to prevent data races.

The Naive Shared Counter

This code has a data race — multiple threads read, increment, and write counter simultaneously:

int counter = 0; #pragma omp parallel for for (int i = 0; i < 1000000; i++) { counter++; // DATA RACE: read-modify-write is not atomic } // counter is unpredictable, typically less than 1000000

Synchronization Primitives

  • atomic — Hardware-level atomic operation. Fast for simple updates (increment, add, compare-and-swap):
#pragma omp atomic counter++;
  • critical — Mutual exclusion section. Only one thread executes the block at a time. Slower than atomic but supports arbitrary code:
#pragma omp critical { shared_map[key] = compute_value(); }
  • barrier — All threads wait until every thread reaches this point. Implicit at the end of work-sharing constructs unless nowait is specified.
  • ordered — Ensures iterations execute in sequential order within the ordered block. Useful for I/O that must appear in order.
  • omp_lock_t — Explicit lock API for fine-grained control. Use omp_set_lock() / omp_unset_lock().

Choose atomic over critical whenever possible — it avoids the overhead of acquiring a mutex and is often implemented as a single CPU instruction.

See what happens when multiple threads increment a shared counter without synchronization, then toggle between critical, atomic, and reduction to fix it.

Scheduling Strategies

When loop iterations have unequal cost, the default static distribution creates load imbalance — some threads finish early and idle while others are still working. OpenMP provides four scheduling strategies to address this.

static

Divides iterations into contiguous blocks of size \lceil N/T \rceil. Lowest overhead, best when iteration cost is uniform. You can specify a chunk size: schedule(static, 64) assigns 64-iteration blocks in round-robin order.

dynamic

Threads grab chunks of iterations from a shared work queue. When a thread finishes its chunk, it gets the next one. Higher overhead than static but adapts to unequal iteration costs:

#pragma omp parallel for schedule(dynamic, 4) for (int i = 0; i < N; i++) { // iterations vary widely in cost process_variable_work(i); }

guided

Like dynamic, but chunk sizes start large and shrink as iterations are consumed. Balances the low overhead of large chunks early on with the fine-grained balancing of small chunks near the end.

auto

Delegates the scheduling decision to the runtime. The implementation may profile execution and adapt. Useful when you lack knowledge about iteration cost distribution.

The load imbalance ratio quantifies the problem. If Tmax is the time of the slowest thread and \bar{T} is the average thread time:

\text{Imbalance} = Tmax - \bar{T}\bar{T} × 100\%

An imbalance above 10–15% typically justifies switching from static to dynamic scheduling.

False Sharing and the Memory Model

False sharing occurs when threads on different cores write to variables that reside on the same cache line, even though each thread accesses a distinct variable. The hardware cache coherence protocol (MESI/MOESI) forces the cache line to bounce between cores on every write, destroying performance despite the absence of any logical data sharing.

Consider this pattern where each thread writes to its own slot in an array:

int counters[NUM_THREADS]; // likely all on the same cache line #pragma omp parallel { int tid = omp_get_thread_num(); for (int i = 0; i < ITERATIONS; i++) { counters[tid]++; // false sharing: adjacent slots share a cache line } }

On a typical x86 processor with 64-byte cache lines, an int[16] array fits entirely in one cache line. All 16 threads contend on the same line, and performance can be 10–50x worse than the correctly padded version.

The fix is padding — ensure each thread’s data occupies its own cache line. For a detailed treatment of cache line mechanics, see CPU Cache Lines.

struct alignas(64) PaddedCounter { int value; }; PaddedCounter counters[NUM_THREADS]; #pragma omp parallel { int tid = omp_get_thread_num(); for (int i = 0; i < ITERATIONS; i++) { counters[tid].value++; // each counter on its own cache line } }

OpenMP uses a relaxed consistency memory model. Threads are not guaranteed to see each other’s writes immediately. The flush directive (implicit at barriers and critical sections) forces memory visibility. In practice, explicit flush is rarely needed if you use proper synchronization constructs.

Watch cache line contention in action. Toggle between adjacent and padded memory layouts to see how false sharing destroys throughput when threads write to the same cache line.

Thread Affinity and NUMA

On multi-socket servers, where threads run matters as much as what they compute. A thread accessing memory attached to a remote NUMA node pays 1.5–3x the latency compared to local memory. OpenMP provides environment variables to pin threads to specific cores and control placement.

OMP_PLACES

Defines the set of hardware resources threads can be bound to:

  • OMP_PLACES=cores — One thread per physical core (recommended default).
  • OMP_PLACES=threads — One thread per hardware thread (uses SMT/hyperthreading).
  • OMP_PLACES=sockets — One thread per CPU socket.

OMP_PROC_BIND

Controls how threads are distributed across places:

  • close — Pack threads near the master thread. Maximizes cache sharing, good for workloads with shared data.
  • spread — Distribute threads evenly across all sockets. Maximizes aggregate memory bandwidth.
  • master — Bind all threads to the same place as the master thread.
export OMP_NUM_THREADS=16 export OMP_PLACES=cores export OMP_PROC_BIND=spread ./my_application

For memory-bandwidth-bound workloads (large matrix operations, streaming access patterns), spread across sockets typically delivers the best performance because it utilizes all memory controllers. For latency-sensitive workloads with heavy data sharing, close avoids cross-socket communication. See NUMA Architecture for a deeper look at non-uniform memory access.

Explore how thread binding strategies map OpenMP threads to physical cores across NUMA sockets. Switch between close, spread, and master to see the latency implications.

Task Dependencies

The task construct with depend clauses enables OpenMP to build a directed acyclic graph (DAG) of task dependencies and execute them with maximum concurrency while respecting ordering constraints.

int x, y, z; #pragma omp parallel #pragma omp single { #pragma omp task depend(out: x) x = read_input(); #pragma omp task depend(out: y) y = read_weights(); #pragma omp task depend(in: x, y) depend(out: z) z = compute(x, y); #pragma omp task depend(in: z) write_output(z); }

The first two tasks run concurrently (no dependency between them). The third task waits for both to complete. The fourth waits for the third. The runtime schedules this DAG across available threads automatically.

Tasks vs. parallel for: Use parallel for when you have a regular loop with uniform iterations. Use tasks when the work is recursive, irregular, or has complex dependency relationships. The classic example is recursive parallelism:

int fib(int n) { if (n < 20) return serial_fib(n); // cutoff to avoid task overhead int x, y; #pragma omp task shared(x) x = fib(n - 1); #pragma omp task shared(y) y = fib(n - 2); #pragma omp taskwait return x + y; }

The taskwait directive acts as a barrier for child tasks, ensuring both x and y are computed before the addition. The cutoff at n < 20 is critical — creating a task for each recursive call would generate millions of tasks with overhead exceeding the computation itself.

GPU Offloading

OpenMP 4.5+ introduced target directives for offloading computation to accelerators, including GPUs. This provides a portable, pragma-based alternative to CUDA or HIP:

#pragma omp target teams distribute parallel for \ map(to: input[0:N]) map(from: output[0:N]) for (int i = 0; i < N; i++) { output[i] = transform(input[i]); }

Data Mapping

  • map(to: x) — Copy host data to the device at region entry.
  • map(from: x) — Copy device data back to the host at region exit.
  • map(tofrom: x) — Copy in both directions.
  • map(alloc: x) — Allocate device memory without copying.

OpenMP Target vs. CUDA

OpenMP target is portable across GPUs from NVIDIA, AMD, and Intel. The trade-off is control: CUDA gives you explicit thread blocks, shared memory management, warp-level primitives, and kernel launch configuration. OpenMP target abstracts these details, which simplifies porting existing CPU code but limits the ability to exploit hardware-specific features.

When OpenMP target makes sense: Porting large C/C++ codebases with regular parallelism where maintaining a single source code is more valuable than extracting the last 10–20% of GPU performance. When to use native CUDA: Latency-critical kernels, algorithms requiring warp shuffles or shared memory tiling, or when you need maximum throughput from a specific GPU architecture.

Compiler support varies: NVIDIA’s nvc++ (HPC SDK) and Clang with LLVM offloading have the most mature implementations. GCC’s GPU offloading support is improving but lags behind.

OpenMP in the ML Ecosystem

OpenMP is deeply embedded in the libraries that power machine learning, even when users never write a pragma themselves.

PyTorch uses OpenMP (via ATen) to parallelize CPU tensor operations — element-wise math, reductions, convolutions on CPU, and data preprocessing. The OMP_NUM_THREADS environment variable directly controls how many threads PyTorch uses for these operations.

MKL and OpenBLAS, the BLAS backends for NumPy, SciPy, and PyTorch CPU, use OpenMP internally for matrix multiplication and linear algebra routines. Setting MKL_NUM_THREADS or OPENBLAS_NUM_THREADS controls their thread counts independently.

The multi-process conflict: Python’s multiprocessing and PyTorch’s DataLoader with num_workers > 0 fork child processes. If each child spawns its own OpenMP thread pool (the default), a machine with 32 cores and 8 data loader workers creates 256 threads, causing severe oversubscription and context-switching overhead.

The standard fix:

export OMP_NUM_THREADS=1 python train.py --workers 8

Setting OMP_NUM_THREADS=1 disables OpenMP parallelism within each worker, letting the process-level parallelism of the data loader handle concurrency. For the training loop itself (which runs on GPU), this has no effect since GPU operations bypass OpenMP entirely.

This interaction between OpenMP threading and Python multiprocessing is one of the most common performance pitfalls in ML engineering. If your CPU utilization is at 100% but throughput is low, thread oversubscription via OpenMP is a likely culprit.

Key Takeaways

  1. Fork-join execution — a single master thread spawns parallel teams at #pragma omp parallel regions and joins at implicit barriers. Speedup is bounded by Amdahl’s Law.

  2. Data sharing is everything — understand shared, private, firstprivate, and reduction clauses. Most OpenMP bugs come from incorrect data sharing.

  3. False sharing destroys performance silently — threads writing to adjacent memory addresses on the same cache line cause constant cache invalidation. Pad your data structures.

  4. Scheduling strategy matters — use static for uniform work, dynamic for variable-cost iterations. Monitor load imbalance to choose correctly.

  5. Thread affinity controls NUMA performance — use OMP_PLACES and OMP_PROC_BIND to pin threads to cores and maximize memory bandwidth on multi-socket systems.

  6. Set OMP_NUM_THREADS=1 in ML pipelines — when using PyTorch DataLoader with multiple workers, disable OpenMP threading to prevent thread oversubscription.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon