Flynn's Classification: Taxonomy of Computer Architectures

Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.

Why Flynn's Classification Still Matters

Michael Flynn proposed this taxonomy in 1966, categorizing architectures by two dimensions: instruction streams and data streams. Nearly 60 years later, every processor you use — from your phone to a GPU cluster — falls into one of his four categories. Understanding which category your workload needs is the first step to making it fast.

The four categories form a 2×2 matrix:

	Single Data	Multiple Data
Single Instruction	SISD	SIMD
Multiple Instruction	MISD	MIMD

Interactive Architecture Explorer

Explore how each architecture processes instructions and data:

SISD: The Sequential Baseline

Single Instruction, Single Data — one instruction operates on one data element per clock cycle. This is the classical von Neumann architecture: fetch an instruction, decode it, execute it on one piece of data, write the result.

Every modern CPU starts here. A single core executing scalar code is SISD. The performance ceiling is the clock frequency times the IPC (instructions per clock). Modern CPUs push IPC to ~5-6 through pipelining, out-of-order execution, and branch prediction — but the fundamental model is still one instruction stream on one data stream.

// Pure SISD — one element at a time
for (int i = 0; i < N; i++) {
    result[i] = a[i] + b[i];  // 1 add per cycle
}

SISD is not dead. It’s the fallback when your data has irregular structure, unpredictable branches, or pointer-chasing access patterns. Compilers try to auto-vectorize SISD code into SIMD, but complex control flow defeats them.

SIMD: Data Parallelism

Single Instruction, Multiple Data — one instruction operates on N data elements simultaneously. Instead of adding two numbers, you add two vectors of 4, 8, 16, or even 512 numbers in a single operation.

This is the workhorse of scientific computing, graphics, and machine learning. Every matrix multiplication, every convolution, every pixel operation maps naturally to SIMD.

The SIMD Evolution

x86 SIMD has grown from 64-bit MMX registers in 1997 to 8192-bit AMX tiles in 2023 — a 128x increase in width:

Real SIMD Code

The difference between scalar and SIMD is stark:

// Scalar: 1 add per cycle
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
}

// AVX2: 8 adds per cycle
for (int i = 0; i < N; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vc = _mm256_add_ps(va, vb);
    _mm256_store_ps(&c[i], vc);
}

// AVX-512: 16 adds per cycle
for (int i = 0; i < N; i += 16) {
    __m512 va = _mm512_load_ps(&a[i]);
    __m512 vb = _mm512_load_ps(&b[i]);
    __m512 vc = _mm512_add_ps(va, vb);
    _mm512_store_ps(&c[i], vc);
}

At 3 GHz with AVX-512, that’s 16 floats × 3 GHz = 48 GFLOPS from a single core — 16x the scalar throughput.

The Branch Divergence Problem

SIMD’s Achilles’ heel: all lanes must execute the same instruction. When threads take different code paths (if/else), the hardware must serialize — execute the if-branch with some lanes masked, then the else-branch with the others masked. This is called branch divergence, and it’s why GPUs struggle with irregular workloads.

When SIMD Fails

SIMD efficiency drops toward zero when:

Branches are data-dependent — each element needs a different code path
Memory access is irregular — scatter/gather instead of contiguous loads
Data has dependencies — element N depends on element N-1 (loop-carried)

MISD: The Odd One Out

Multiple Instructions, Single Data — different instructions process the same data stream simultaneously. This is the rarest category, and many textbooks dismiss it as “theoretical.” But two real applications exist:

Triple Modular Redundancy (TMR)

The Space Shuttle’s flight control ran five identical computers processing the same sensor data with different software implementations. If any computer disagreed with the majority, it was voted out. This is MISD: multiple instruction streams (different software) operating on the same data stream (sensor readings).

Nuclear reactor control systems, aircraft fly-by-wire, and medical devices use the same pattern. When a wrong answer kills people, you run three independent computations on the same input and take the majority vote.

Systolic Arrays

Google’s TPU is a systolic array — data flows through a grid of processing elements, each applying a different operation (multiply, accumulate) to the same data stream as it passes through. While typically classified as a specialized SIMD architecture, the data-reuse pattern is fundamentally MISD: one data stream, multiple processing stages.

MIMD: True Parallelism

Multiple Instructions, Multiple Data — independent processors execute different instructions on different data simultaneously. This is the most flexible and most common parallel architecture.

Shared Memory MIMD

All processors share the same address space. Communication is implicit — one processor writes to memory, another reads it. The challenge is cache coherence: when CPU 0 modifies a cache line that CPU 1 has cached, the hardware must invalidate CPU 1’s copy.

// OpenMP — shared memory MIMD
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    // Each thread can take different code paths
    if (data[i] > threshold)
        result[i] = expensive_computation(data[i]);
    else
        result[i] = cheap_fallback(data[i]);
}

Modern multi-core CPUs are shared-memory MIMD. A 64-core EPYC runs 64 independent instruction streams on 64 independent data streams, sharing DRAM through a cache-coherent interconnect.

Distributed Memory MIMD

Each processor has private memory. Communication is explicit — processes send and receive messages over a network. This scales to thousands of nodes but requires the programmer to manage data distribution.

// MPI — distributed memory MIMD
if (rank == 0) {
    MPI_Send(data, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
} else {
    MPI_Recv(data, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
    process(data);
}

HPC clusters, cloud computing, and MapReduce are all distributed-memory MIMD. The key advantage over SIMD: each processor can execute completely different code, making MIMD ideal for irregular workloads, web servers, and databases.

Which Architecture for Which Workload?

The right architecture depends on your workload’s structure. Regular, data-parallel workloads favor SIMD. Irregular, task-parallel workloads favor MIMD:

Modern Hybrid Architectures

No real processor is purely one category. Modern hardware combines all of Flynn’s categories:

A modern CPU core is SISD for scalar code, SIMD within its vector units (AVX), and part of a MIMD system at the multi-core level. A 16-core CPU with AVX-512 is simultaneously MIMD (16 independent instruction streams) and SIMD (16 float lanes per core) — giving 16 × 16 = 256 float operations per cycle.

A GPU uses NVIDIA’s SIMT (Single Instruction, Multiple Thread) model: 32 threads form a warp that executes in lockstep (SIMD), but thousands of warps run independently (MIMD). The GPU is SIMD within a warp and MIMD across warps. This hybrid lets GPUs handle moderate branching without collapsing to serial execution.

A CPU+GPU system is MIMD at the top level: the CPU handles control flow and serial tasks while the GPU handles data-parallel compute. Within each device, the parallelism is further decomposed into SIMD lanes and MIMD cores.