Skip to main content

C++ Compiler Optimization

C++ compiler optimization deep dive — optimization levels compared with assembly output, auto-vectorization, LTO, PGO, compiler flags reference, and dangerous flags explained.

Best viewed on desktop for optimal interactive experience

Why Compiler Optimization Matters

The difference between -O0 and -O2 is typically 5-10x in execution speed. Between -O0 and -O3 with auto-vectorization, it can be 20-40x for numerical code. Understanding what your compiler does — and what it can’t do — is the difference between code that crawls and code that flies.

Modern compilers are remarkably good at optimizing straightforward code. But they’re not magic. They need your help: writing code that’s amenable to optimization, using the right flags, and occasionally providing hints when the compiler can’t prove a transformation is safe.

Optimization Passes

The compiler transforms your code through a pipeline of optimization passes. Each pass looks for a specific pattern and rewrites it into something faster or smaller:

Optimization Levels Compared

Each -O level enables progressively more aggressive optimizations. The difference isn’t just “faster” — the compiler generates fundamentally different assembly at each level:

What Each Level Does

-O0 (No optimization): The compiler translates your C++ almost literally. Every variable lives on the stack. Every function call goes through the full call sequence. This is what you want for debugging — the assembly maps directly to your source.

-O1 (Basic): Variables move to registers. Dead stores are eliminated. Simple control flow is cleaned up. Compilation is still fast, and debugging is mostly possible.

-O2 (Recommended for release): The sweet spot. Enables constant folding, dead code elimination, function inlining, loop-invariant code motion, strength reduction, and dozens more passes. This is the standard for production builds. Most code should never need more.

-O3 (Aggressive): Everything in -O2 plus auto-vectorization (SIMD), aggressive inlining, and loop unrolling. Can actually be slower than -O2 when the larger code causes instruction cache (I-cache) misses. Always benchmark.

-Os (Size optimization): Like -O2 but avoids transformations that increase code size (no loop unrolling, conservative inlining). Excellent for embedded systems and code that’s I-cache sensitive.

-O3 Is Not Always Faster

-O3 enables aggressive inlining and loop unrolling that increase binary size. On code with large working sets, the extra I-cache pressure can make -O3 slower than -O2. Always measure. The right answer is often -O2 -march=native.

What the Compiler Actually Does

Constant Folding and Propagation

The compiler evaluates constant expressions at compile time:

// Before optimization int x = 2 * 3 + 4; int y = x * 2; // After constant folding + propagation int x = 10; // 2*3+4 computed at compile time int y = 20; // x*2 propagated and folded

Dead Code Elimination

Unreachable code is removed entirely:

void process(int mode) { if constexpr (DEBUG_MODE) { // false at compile time log_detailed_state(); // entire block removed } do_work(); // only this remains }

Function Inlining

Small functions are copied into the caller, eliminating call overhead:

// Before: function call overhead (push args, call, return) inline int square(int x) { return x * x; } int y = square(5); // After: no function call at all int y = 25; // constant folded after inlining

The compiler decides whether to inline based on function size, call frequency, and optimization level. The inline keyword is a suggestion, not a command — the compiler often ignores it and inlines functions you didn’t mark.

Loop-Invariant Code Motion (LICM)

Computations that don’t change across loop iterations are hoisted out:

// Before for (int i = 0; i < n; i++) result[i] = data[i] * config.scale_factor; // config.scale_factor is constant // After LICM float sf = config.scale_factor; // hoisted out of loop for (int i = 0; i < n; i++) result[i] = data[i] * sf;

Strength Reduction

Expensive operations are replaced with cheaper equivalents:

// Before for (int i = 0; i < n; i++) arr[i * 4] // multiplication each iteration // After strength reduction int idx = 0; for (int i = 0; i < n; i++, idx += 4) arr[idx] // addition instead of multiplication

Auto-Vectorization

At -O3 (or -O2 -ftree-vectorize), the compiler attempts to convert scalar loops into SIMD operations. When it succeeds, throughput multiplies by the vector width (4x for SSE, 8x for AVX2, 16x for AVX-512). When it fails, you get no speedup and no warning unless you ask for one.

Helping the Compiler Vectorize

// 1. Use __restrict__ to promise no aliasing void scale(float* __restrict__ out, const float* __restrict__ in, float f, int n) { for (int i = 0; i < n; i++) out[i] = in[i] * f; // vectorizes freely } // 2. Use #pragma to force vectorization #pragma GCC optimize("O3,tree-vectorize") void compute(float* data, int n) { #pragma GCC ivdep // ignore vector dependencies for (int i = 0; i < n; i++) data[i] = process(data[i]); } // 3. Check what the compiler did // g++ -O2 -march=native -ftree-vectorize -fopt-info-vec-optimized source.cpp // Clang: -Rpass=loop-vectorize -Rpass-missed=loop-vectorize

Normal compilation optimizes each .cpp file independently. The compiler can’t inline a function from utils.cpp into main.cpp because it never sees both at once.

LTO defers optimization to link time, when all translation units are visible. This enables:

  • Cross-file inlining — inline utils::compute() into main() even though they’re in different files
  • Dead function elimination — remove functions that are defined but never called across the whole program
  • Interprocedural constant propagation — if main() always passes mode=3, the compiler can specialize process(mode) for that value
# Compile with LTO g++ -O2 -flto file1.cpp file2.cpp file3.cpp -o binary # Or with separate compilation (for build systems) g++ -O2 -flto -c file1.cpp -o file1.o g++ -O2 -flto -c file2.cpp -o file2.o g++ -O2 -flto file1.o file2.o -o binary # LTO happens here

LTO typically provides 5-15% additional improvement over -O2 alone. The cost is significantly slower link times (the linker is now running optimization passes).

Profile-Guided Optimization (PGO)

PGO uses real runtime behavior to guide optimization decisions. The compiler instruments your code, you run it with representative data, then recompile using the collected profile. The result: better branch prediction, smarter inlining, and hot/cold code separation.

# Step 1: Compile with instrumentation g++ -O2 -fprofile-generate source.cpp -o binary_instrumented # Step 2: Run with representative workload ./binary_instrumented --input training_data.bin # Step 3: Recompile using the profile g++ -O2 -fprofile-use source.cpp -o binary_optimized

PGO typically provides 10-20% improvement on branch-heavy code (parsers, interpreters, compilers themselves). Google uses PGO for Chrome, LLVM uses it for Clang, and most game engines use it for release builds.

Compiler Flags Reference

Build your own optimized compile command:

Dangerous Flags

-ffast-math

This flag tells the compiler to ignore IEEE 754 floating-point rules:

  • Assumes no NaN or Infisnan() always returns false
  • Allows reassociation(a + b) + c can become a + (b + c) (not equivalent in floating point!)
  • Assumes no signed zeros-0.0 == 0.0 always
  • Enables reciprocal approximationa / b can become a * (1/b) (less precise)

This can speed up numerical code by 20-50% but silently produces wrong results for code that depends on IEEE semantics. Never use it for financial calculations, scientific simulations, or any code that checks for NaN.

-fno-exceptions and -fno-rtti

Disabling exceptions removes the exception handling tables (making binaries smaller and slightly faster) but means any throw is undefined behavior. Disabling RTTI removes dynamic_cast and typeid. Both are common in game engines and embedded systems where the overhead is unacceptable.

Measuring Optimization Impact

Compiler Reports

# GCC: detailed optimization report g++ -O2 -fopt-info-all=report.txt source.cpp # Clang: what was and wasn't optimized clang++ -O2 -Rpass=.* -Rpass-missed=.* source.cpp # Compilation time per pass g++ -O2 -ftime-report source.cpp

Godbolt Compiler Explorer

godbolt.org lets you paste C++ code and instantly see the assembly output from any compiler version. This is the fastest way to understand what your compiler is doing. Use it to:

  • Compare -O2 vs -O3 assembly for a specific function
  • Check if your loop was vectorized
  • See if the compiler eliminated a branch
  • Compare GCC vs Clang output for the same code

Key Takeaways

  1. -O2 is the standard — it enables most optimizations with a good compile-time tradeoff. Start here, not -O3.

  2. -O3 adds auto-vectorization — can be 8x faster for numerical loops, but can also be slower due to I-cache pressure. Always benchmark.

  3. Auto-vectorization needs help — use __restrict__ to resolve aliasing, avoid loop-carried dependencies, and check compiler remarks.

  4. LTO unlocks cross-file optimization — 5-15% gains from inlining and dead code elimination across translation units.

  5. PGO uses runtime data — 10-20% gains from better branch prediction and hot/cold code separation.

  6. -ffast-math is dangerous — 20-50% faster floating point, but breaks IEEE 754. Never use for financial or scientific code.

  7. Use Godbolt — godbolt.org shows you exactly what assembly your compiler generates. Use it before guessing.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon