Modern CPUs achieve high performance through sophisticated pipeline architectures that enable instruction-level parallelism. This page walks through pipelining from first principles — why it exists, how it breaks, and how real CPUs paper over the breaks with forwarding, prediction, and out-of-order execution.
Why pipeline at all?
A non-pipelined CPU finishes one instruction before starting the next. Each instruction spends ~5 cycles passing through fetch, decode, execute, memory, and write-back. Three instructions take 15 cycles.
Pipelining overlaps the stages: while instruction 1 is in EX, instruction 2 is in ID, instruction 3 is in IF. Same hardware, same work per instruction — but the throughput jumps.
Without pipelining — one instruction at a time
With pipelining — stages overlap
Same three instructions. Less than half the cycles. That's the entire point of pipelining.
The 5-stage RISC pipeline
The canonical RISC pipeline divides instruction execution into five stages, each one cycle long. Between every pair of stages sits a pipeline register that latches the previous stage's output. Once the pipeline is full, one instruction completes every cycle.
Between each pair of stages sits a pipeline register that latches the previous stage's output for the next stage to consume on the next clock edge.
Steady-state: one instruction completes per cycle
Hazards: when pipelining gets messy
An instruction's stage may need data, control flow, or hardware that another in-flight instruction hasn't released. These dependency conflicts are called hazards. There are three kinds, in roughly decreasing severity for modern CPUs.
LOAD then dependent ADD
LOAD R2, [mem] // R2 won't be written back until cycle 5
ADD R4, R2, R3 // reads R2 — needs the result earlyForwarding routes the loaded value directly from the MEM stage of LD into the EX stage of ADD, skipping the wait for write-back. Stalls (gray ·) shrink from two to one. The ADD finishes two cycles earlier.
BEQ R1, R2, label
ADD R3, R4, R5 // fetched speculativelyi1: MEM (load)
i2: IF // wants the same memory portSuperscalar: doing more per cycle
A single pipeline tops out at IPC = 1.0 (one instruction completes per cycle in steady state). Superscalar designs add multiple pipelines side by side so the front-end fetches and issues more than one instruction per cycle.
Two pipelines running in parallel
Two independent execution pipelines fetch and execute two instructions per cycle. Real superscalar CPUs add register renaming, scheduler logic, and out-of-order issue to keep both pipelines fed when dependencies appear.
Real superscalar CPUs need register renaming and a scheduler to keep both pipelines fed when dependencies appear. That's where out-of-order execution comes in.
Out-of-order execution
Modern high-performance CPUs execute instructions as soon as their operands are ready, regardless of program order. A reorder buffer (ROB) puts the results back in program order before they become visible. This decouples scheduling from semantics: any execution order that respects data dependencies is legal.
Out-of-Order pipeline
Front-end keeps the back-end fed by renaming and queueing. Execution runs in any order that satisfies dependencies. The ROB and commit stage put the world back in program order before anything becomes visible.
Key components:
- Issue queue (reservation stations) — hold decoded instructions until their operands are ready.
- Register renaming — map architectural registers to a larger pool of physical registers, eliminating false dependencies (WAR, WAW).
- Reorder buffer (ROB) — track completion in program order; commit retires instructions atomically.
Modern CPU examples
| Architecture | Pipeline depth | Issue width | Out-of-order | Branch prediction | Typical use |
|---|---|---|---|---|---|
| Intel x86-64 | 14–19 | 4–6 wide | Aggressive | TAGE + perceptron | Desktop, server, workstation |
| ARM Cortex-A | 8–15 | 2–6 wide | A15+ (in-order on small cores) | Tournament + RAS | Mobile, embedded, server |
| RISC-V (Boom, SiFive) | 5–10 | 1–4 wide | Configurable | TAGE, simpler variants | Research, custom silicon, education |
Try it — interactive pipeline playground
Pick a scenario, flip forwarding off, watch stalls appear. Turn prediction off and see how a branch costs you cycles. Switch pipeline depth from 5 to 10 and see throughput stay flat (latency increases, but steady-state IPC doesn't).
Practical implications
For software developers
- Branches are expensive. Minimize unpredictable branches in hot paths — they're the difference between IPC ≈ 1 and IPC ≈ 0.3.
- Data locality matters. Cache misses stall the pipeline; pipeline tricks can't paper over a 200-cycle DRAM latency.
- ILP-friendly code. Independent operations let superscalar back-ends shine; long dependency chains serialize execution.
- Profile, don't guess. Hardware performance counters report stalls, mispredictions, and IPC directly.
For system designers
- Depth is a tradeoff. Deeper pipelines clock higher but pay larger mispredict penalties.
- Power scales with stages. Every pipeline register dissipates static power.
- Branch prediction is load-bearing. Modern designs run >95% prediction accuracy because every miss is dozens of wasted cycles.
- Memory hierarchy is half the battle. Pipeline efficiency only matters if the cache keeps it fed.
Common misconceptions
- "Deeper pipelines are always faster." Not true — past a point, branch mispredictions and clock-distribution overhead eat the gains.
- "Out-of-order eliminates all stalls." Memory latency, long dependency chains, and limited physical register pools still serialize execution.
- "Superscalar means N× faster." Real workloads have dependencies and resource contention; 4-wide CPUs rarely sustain IPC > 2.
- "Modern CPUs execute sequentially." They aggressively reorder, speculate, and parallelize. The ROB makes it look sequential.
Further reading
- Patterson & Hennessy: Computer Organization and Design
- Shen & Lipasti: Modern Processor Design
- Intel / AMD / ARM optimization guides
- Agner Fog's optimization manuals
