CPU Pipeline Architecture

Modern CPUs achieve high performance through sophisticated pipeline architectures that enable instruction-level parallelism. This page walks through pipelining from first principles — why it exists, how it breaks, and how real CPUs paper over the breaks with forwarding, prediction, and out-of-order execution.

Why pipeline at all?

A non-pipelined CPU finishes one instruction before starting the next. Each instruction spends ~5 cycles passing through fetch, decode, execute, memory, and write-back. Three instructions take 15 cycles.

Pipelining overlaps the stages: while instruction 1 is in EX, instruction 2 is in ID, instruction 3 is in IF. Same hardware, same work per instruction — but the throughput jumps.

Without pipelining — one instruction at a time

c10

c11

c12

c13

c14

c15

MEM

15cycles0.20IPC

With pipelining — stages overlap

c10

c11

c12

c13

c14

c15

MEM

7cycles0.43IPC2.1×speedup

Same three instructions. Less than half the cycles. That's the entire point of pipelining.

The 5-stage RISC pipeline

The canonical RISC pipeline divides instruction execution into five stages, each one cycle long. Between every pair of stages sits a pipeline register that latches the previous stage's output. Once the pipeline is full, one instruction completes every cycle.

Instruction Fetch

PC → Memory. Fetch the next instruction word.

Instruction Decode

Decode the instruction. Read source registers.

Execute

ALU operates on inputs. Compute addresses for loads/stores.

MEM

Memory Access

Read or write data memory (for loads / stores).

Write Back

Write the ALU result or loaded value back to the register file.

Between each pair of stages sits a pipeline register that latches the previous stage's output for the next stage to consume on the next clock edge.

Steady-state: one instruction completes per cycle

MEM

Hazards: when pipelining gets messy

An instruction's stage may need data, control flow, or hardware that another in-flight instruction hasn't released. These dependency conflicts are called hazards. There are three kinds, in roughly decreasing severity for modern CPUs.

Data hazard · deep dive

LOAD then dependent ADD

LOAD R2, [mem]      // R2 won't be written back until cycle 5
ADD  R4, R2, R3     // reads R2 — needs the result early

Broken — no forwarding, two stalls

MEM

ADD

MEM

Fixed — MEM→EX forwarding, one stall

MEM

ADD

MEM

Forwarding routes the loaded value directly from the MEM stage of LD into the EX stage of ADD, skipping the wait for write-back. Stalls (gray ·) shrink from two to one. The ADD finishes two cycles earlier.

Control hazard

BEQ  R1, R2, label
ADD  R3, R4, R5    // fetched speculatively

Symptom: Branch outcome unknown until EX. Next instructions may need to be squashed.

Fix: Branch prediction + speculative fetch; small misprediction penalty when wrong.

Structural hazard

i1: MEM (load)
i2: IF              // wants the same memory port

Symptom: Two instructions need one shared hardware resource at the same cycle.

Fix: Split I/D caches (Harvard), duplicate ALUs, or stall the loser one cycle.

Superscalar: doing more per cycle

A single pipeline tops out at IPC = 1.0 (one instruction completes per cycle in steady state). Superscalar designs add multiple pipelines side by side so the front-end fetches and issues more than one instruction per cycle.

Two pipelines running in parallel

Pipeline 0

p0/i1

MEM

p0/i2

MEM

p0/i3

MEM

p0/i4

MEM

Pipeline 1

p1/i1

MEM

p1/i2

MEM

p1/i3

MEM

p1/i4

MEM

1.0

Scalar IPC

1.8

Superscalar IPC

Cycles (scalar)

Cycles (super, 8 instr)

Two independent execution pipelines fetch and execute two instructions per cycle. Real superscalar CPUs add register renaming, scheduler logic, and out-of-order issue to keep both pipelines fed when dependencies appear.

Real superscalar CPUs need register renaming and a scheduler to keep both pipelines fed when dependencies appear. That's where out-of-order execution comes in.

Out-of-order execution

Modern high-performance CPUs execute instructions as soon as their operands are ready, regardless of program order. A reorder buffer (ROB) puts the results back in program order before they become visible. This decouples scheduling from semantics: any execution order that respects data dependencies is legal.

Out-of-Order pipeline

Fetch

Read instruction stream

↓

Decode

Parse opcodes, operands

↓

Rename

Map architectural → physical regs

Removes false dependencies

↓

Issue Queue

Hold instructions until operands ready

Wakes up ready instructions

↓

Execution Units

ALU · FPU · Load/Store

Run in parallel, out of program order

↓

Reorder Buffer (ROB)

Hold results in program order

Bookkeeping for in-order retire

↓

Commit

Retire instructions in program order

State changes become architecturally visible

Fetch

Instruction stream

Decode

Parse opcodes

Rename

Map arch → phys regs

Issue Queue

Operand wakeup, schedule ready instr

Execution Units

ALU × N · FPU × N · Load/Store × N

Instructions execute as soon as their operands are ready — not in program order.

Reorder Buffer (ROB)

Holds completed results in program order until ready to retire

Commit

Retire in program order; flush on mispredict

Front-end keeps the back-end fed by renaming and queueing. Execution runs in any order that satisfies dependencies. The ROB and commit stage put the world back in program order before anything becomes visible.

Key components:

Issue queue (reservation stations) — hold decoded instructions until their operands are ready.
Register renaming — map architectural registers to a larger pool of physical registers, eliminating false dependencies (WAR, WAW).
Reorder buffer (ROB) — track completion in program order; commit retires instructions atomically.

Modern CPU examples

Architecture	Pipeline depth	Issue width	Out-of-order	Branch prediction	Typical use
Intel x86-64	14–19	4–6 wide	Aggressive	TAGE + perceptron	Desktop, server, workstation
ARM Cortex-A	8–15	2–6 wide	A15+ (in-order on small cores)	Tournament + RAS	Mobile, embedded, server
RISC-V (Boom, SiFive)	5–10	1–4 wide	Configurable	TAGE, simpler variants	Research, custom silicon, education

Branches are expensive. Minimize unpredictable branches in hot paths — they're the difference between IPC ≈ 1 and IPC ≈ 0.3.
Data locality matters. Cache misses stall the pipeline; pipeline tricks can't paper over a 200-cycle DRAM latency.
ILP-friendly code. Independent operations let superscalar back-ends shine; long dependency chains serialize execution.
Profile, don't guess. Hardware performance counters report stalls, mispredictions, and IPC directly.

For system designers

Depth is a tradeoff. Deeper pipelines clock higher but pay larger mispredict penalties.
Power scales with stages. Every pipeline register dissipates static power.
Branch prediction is load-bearing. Modern designs run >95% prediction accuracy because every miss is dozens of wasted cycles.
Memory hierarchy is half the battle. Pipeline efficiency only matters if the cache keeps it fed.

Common misconceptions

"Deeper pipelines are always faster." Not true — past a point, branch mispredictions and clock-distribution overhead eat the gains.
"Out-of-order eliminates all stalls." Memory latency, long dependency chains, and limited physical register pools still serialize execution.
"Superscalar means N× faster." Real workloads have dependencies and resource contention; 4-wide CPUs rarely sustain IPC > 2.
"Modern CPUs execute sequentially." They aggressively reorder, speculate, and parallelize. The ROB makes it look sequential.