Skip to main content

CPU Pipeline Architecture

Deep dive into CPU pipeline architecture covering 5-stage RISC pipelines, data hazards, control hazards, superscalar execution, and out-of-order processing.

Modern CPUs achieve high performance through sophisticated pipeline architectures that enable instruction-level parallelism. This page walks through pipelining from first principles — why it exists, how it breaks, and how real CPUs paper over the breaks with forwarding, prediction, and out-of-order execution.

Why pipeline at all?

A non-pipelined CPU finishes one instruction before starting the next. Each instruction spends ~5 cycles passing through fetch, decode, execute, memory, and write-back. Three instructions take 15 cycles.

Pipelining overlaps the stages: while instruction 1 is in EX, instruction 2 is in ID, instruction 3 is in IF. Same hardware, same work per instruction — but the throughput jumps.

Without pipelining — one instruction at a time

15cycles0.20IPC

With pipelining — stages overlap

7cycles0.43IPC2.1×speedup

Same three instructions. Less than half the cycles. That's the entire point of pipelining.

The 5-stage RISC pipeline

The canonical RISC pipeline divides instruction execution into five stages, each one cycle long. Between every pair of stages sits a pipeline register that latches the previous stage's output. Once the pipeline is full, one instruction completes every cycle.

IF
Instruction Fetch
PC → Memory. Fetch the next instruction word.
ID
Instruction Decode
Decode the instruction. Read source registers.
EX
Execute
ALU operates on inputs. Compute addresses for loads/stores.
MEM
Memory Access
Read or write data memory (for loads / stores).
WB
Write Back
Write the ALU result or loaded value back to the register file.

Between each pair of stages sits a pipeline register that latches the previous stage's output for the next stage to consume on the next clock edge.

Steady-state: one instruction completes per cycle

Hazards: when pipelining gets messy

An instruction's stage may need data, control flow, or hardware that another in-flight instruction hasn't released. These dependency conflicts are called hazards. There are three kinds, in roughly decreasing severity for modern CPUs.

Data hazard · deep dive

LOAD then dependent ADD

LOAD R2, [mem]      // R2 won't be written back until cycle 5
ADD  R4, R2, R3     // reads R2 — needs the result early
Broken — no forwarding, two stalls
Fixed — MEM→EX forwarding, one stall

Forwarding routes the loaded value directly from the MEM stage of LD into the EX stage of ADD, skipping the wait for write-back. Stalls (gray ·) shrink from two to one. The ADD finishes two cycles earlier.

Control hazard
BEQ  R1, R2, label
ADD  R3, R4, R5    // fetched speculatively
Symptom: Branch outcome unknown until EX. Next instructions may need to be squashed.
Fix: Branch prediction + speculative fetch; small misprediction penalty when wrong.
Structural hazard
i1: MEM (load)
i2: IF              // wants the same memory port
Symptom: Two instructions need one shared hardware resource at the same cycle.
Fix: Split I/D caches (Harvard), duplicate ALUs, or stall the loser one cycle.

Superscalar: doing more per cycle

A single pipeline tops out at IPC = 1.0 (one instruction completes per cycle in steady state). Superscalar designs add multiple pipelines side by side so the front-end fetches and issues more than one instruction per cycle.

Two pipelines running in parallel

Pipeline 0
Pipeline 1
1.0
Scalar IPC
1.8
Superscalar IPC
8
Cycles (scalar)
9
Cycles (super, 8 instr)

Two independent execution pipelines fetch and execute two instructions per cycle. Real superscalar CPUs add register renaming, scheduler logic, and out-of-order issue to keep both pipelines fed when dependencies appear.

Real superscalar CPUs need register renaming and a scheduler to keep both pipelines fed when dependencies appear. That's where out-of-order execution comes in.

Out-of-order execution

Modern high-performance CPUs execute instructions as soon as their operands are ready, regardless of program order. A reorder buffer (ROB) puts the results back in program order before they become visible. This decouples scheduling from semantics: any execution order that respects data dependencies is legal.

Out-of-Order pipeline

Fetch
Read instruction stream
Decode
Parse opcodes, operands
Rename
Map architectural → physical regs
Removes false dependencies
Issue Queue
Hold instructions until operands ready
Wakes up ready instructions
Execution Units
ALU · FPU · Load/Store
Run in parallel, out of program order
Reorder Buffer (ROB)
Hold results in program order
Bookkeeping for in-order retire
Commit
Retire instructions in program order
State changes become architecturally visible

Front-end keeps the back-end fed by renaming and queueing. Execution runs in any order that satisfies dependencies. The ROB and commit stage put the world back in program order before anything becomes visible.

Key components:

  • Issue queue (reservation stations) — hold decoded instructions until their operands are ready.
  • Register renaming — map architectural registers to a larger pool of physical registers, eliminating false dependencies (WAR, WAW).
  • Reorder buffer (ROB) — track completion in program order; commit retires instructions atomically.

Modern CPU examples

ArchitecturePipeline depthIssue widthOut-of-orderBranch predictionTypical use
Intel x86-6414–194–6 wideAggressiveTAGE + perceptronDesktop, server, workstation
ARM Cortex-A8–152–6 wideA15+ (in-order on small cores)Tournament + RASMobile, embedded, server
RISC-V (Boom, SiFive)5–101–4 wideConfigurableTAGE, simpler variantsResearch, custom silicon, education

Try it — interactive pipeline playground

Pick a scenario, flip forwarding off, watch stalls appear. Turn prediction off and see how a branch costs you cycles. Switch pipeline depth from 5 to 10 and see throughput stay flat (latency increases, but steady-state IPC doesn't).

Practical implications

For software developers

  • Branches are expensive. Minimize unpredictable branches in hot paths — they're the difference between IPC ≈ 1 and IPC ≈ 0.3.
  • Data locality matters. Cache misses stall the pipeline; pipeline tricks can't paper over a 200-cycle DRAM latency.
  • ILP-friendly code. Independent operations let superscalar back-ends shine; long dependency chains serialize execution.
  • Profile, don't guess. Hardware performance counters report stalls, mispredictions, and IPC directly.

For system designers

  • Depth is a tradeoff. Deeper pipelines clock higher but pay larger mispredict penalties.
  • Power scales with stages. Every pipeline register dissipates static power.
  • Branch prediction is load-bearing. Modern designs run >95% prediction accuracy because every miss is dozens of wasted cycles.
  • Memory hierarchy is half the battle. Pipeline efficiency only matters if the cache keeps it fed.

Common misconceptions

  1. "Deeper pipelines are always faster." Not true — past a point, branch mispredictions and clock-distribution overhead eat the gains.
  2. "Out-of-order eliminates all stalls." Memory latency, long dependency chains, and limited physical register pools still serialize execution.
  3. "Superscalar means N× faster." Real workloads have dependencies and resource contention; 4-wide CPUs rarely sustain IPC > 2.
  4. "Modern CPUs execute sequentially." They aggressively reorder, speculate, and parallelize. The ROB makes it look sequential.

Further reading

  • Patterson & Hennessy: Computer Organization and Design
  • Shen & Lipasti: Modern Processor Design
  • Intel / AMD / ARM optimization guides
  • Agner Fog's optimization manuals

If you found this explanation helpful, consider sharing it with others.

Mastodon