SoA vs AoS: Data Layout Optimization

Why Data Layout Matters

When storing collections of multi-field data—particles, vertices, database records—the memory layout choice between Array of Structures (AoS) and Structure of Arrays (SoA) can result in 10-100x performance differences. This single architectural decision affects CPU cache efficiency, SIMD vectorization, and GPU memory coalescing.

The Library Analogy

Imagine organizing a library of books, where each book has: title, author, year, and genre.

AoS (Traditional Shelving): Each book sits together with all its information on one shelf card.

To find all titles? You must visit every single shelf and read each card.
Great when you need everything about one specific book.

SoA (Columnar Organization): All titles on one shelf, all authors on another, all years on a third.

To find all titles? Just visit the titles shelf—done!
Perfect when you only need one piece of information from every book.

This is exactly how CPUs access memory. SoA lets the CPU grab what it needs without wading through irrelevant data.

The Library Analogy

Imagine organizing a library. Which layout minimizes shelf visits?

Your Search Query

Scan every book title in the library

AoS: Traditional Shelving

Each shelf = one complete book record

visits

Book 1

Clean Code

R. Martin

2008

Tech

Book 2

SICP

Abelson

1984

Book 3

TAOCP

Knuth

1968

Algo

Book 4

Design Patterns

GoF

1994

Tech

SoA: Columnar Organization

Each shelf = one field type from all books

visits

Titles

Clean Code

SICP

TAOCP

Design Patterns

Authors

R. Martin

Abelson

Knuth

GoF

Years

2008

1984

1968

1994

Genres

Tech

Algo

Tech

Fields:

title

author

year

genre

Understanding the Two Layouts

Array of Structures (AoS)

Groups all fields of each object together in memory. Each particle's x, y, z, velocity, and mass are stored contiguously. Natural for object-oriented thinking.

Structure of Arrays (SoA)

Groups each field into separate contiguous arrays. All x-values together, all y-values together. Optimal for batch processing and SIMD operations.

Why Layout Matters: The Cache Efficiency Story

CPU requests a single value — You ask for particle[0].x—just 4 bytes of data.
Hardware loads entire cache line — The CPU doesn't fetch 4 bytes. It loads a full 64-byte cache line containing that address.
Layout determines what comes along — With AoS, you get x, y, z, vx, vy, vz, mass, charge for ONE particle (useful if you need all fields). With SoA, you get x₀, x₁, x₂... x₁₅ for 16 particles (useful if processing all x-values).
Unused data wastes bandwidth — If you only need x-values across all particles, AoS wastes 87.5% of loaded data. SoA uses 100%.

Cache Efficiency Comparison

Access Pattern	AoS Efficiency	SoA Efficiency
All fields of one object	100%	12.5%
Position only (x, y, z)	37.5%	100%
Single field (x) across all	12.5%	100%

The pattern is clear: AoS wins for random access to complete objects. SoA wins overwhelmingly for batch operations on specific fields—which is the common case in simulations, games, and data processing.

Interactive Memory Explorer

Explore how different data layouts affect memory access patterns and cache efficiency:

Memory Layout Explorer

See how data layout affects what the CPU loads into cache

Data Layout

What CPU Needs

Scenario: Processing Single field (x) for 8 particles

You need to update the x-position of every particle. Watch how much extra data gets loaded.

Memory Layout (AOS) — Each card = 1 cache line load

Line 1

Particle 0 (x,y,z,vx)

xP0

yP0

zP0

vxP0

Line 2

Particle 0 (vy,vz,m,c)

vyP0

vzP0

mP0

cP0

Line 3

Particle 1 (x,y,z,vx)

xP1

yP1

zP1

vxP1

Line 4

Particle 1 (vy,vz,m,c)

vyP1

vzP1

mP1

cP1

Line 5

Particle 2 (x,y,z,vx)

xP2

yP2

zP2

vxP2

Line 6

Particle 2 (vy,vz,m,c)

vyP2

vzP2

mP2

cP2

Line 7

Particle 3 (x,y,z,vx)

xP3

yP3

zP3

vxP3

Line 8

Particle 3 (vy,vz,m,c)

vyP3

vzP3

mP3

cP3

Line 9

Particle 4 (x,y,z,vx)

xP4

yP4

zP4

vxP4

Line 10

Particle 4 (vy,vz,m,c)

vyP4

vzP4

mP4

cP4

Line 11

Particle 5 (x,y,z,vx)

xP5

yP5

zP5

vxP5

Line 12

Particle 5 (vy,vz,m,c)

vyP5

vzP5

mP5

cP5

Line 13

Particle 6 (x,y,z,vx)

xP6

yP6

zP6

vxP6

Line 14

Particle 6 (vy,vz,m,c)

vyP6

vzP6

mP6

cP6

Line 15

Particle 7 (x,y,z,vx)

xP7

yP7

zP7

vxP7

Line 16

Particle 7 (vy,vz,m,c)

vyP7

vzP7

mP7

cP7

Useful (needed)

Wasted (loaded but not needed)

Needed (waiting to load)

Cache Lines

Useful Values

Wasted Values

25%

Efficiency

AoS wastes 87.5% of bandwidth!

With AoS, each cache line brings in data for one particle. When you only need x values, the other fields are loaded but never used.

SIMD Vectorization

Modern processors don't just process one value at a time. SIMD (Single Instruction, Multiple Data) processes 4-16 values simultaneously using vector instructions like AVX2.

The Problem with AoS: To process 8 x-values, the CPU must gather them from 8 different memory locations (scattered across 8 particle structures). This "gather" operation is slow.

The SoA Advantage: All 8 x-values are already adjacent in memory. One instruction loads all 8. One instruction processes all 8. One instruction stores all 8.

SIMD Vectorization Demo

Watch how CPU loads 8 x-values using AVX2 vector instructions

Data Layout

Memory Layout (AOS)

Scattered access (stride = 8)

AVX2 Vector Register (256-bit = 8 floats)

—

vgatherdps ymm0, [rax + idx*32] // 8 separate fetches

Instructions

~24

CPU Cycles

12.5%

Bandwidth Used

Fields:

Target (x values)

GPU Memory Coalescing

GPUs are even more sensitive to memory layout. A warp of 32 threads accessing data:

AoS: 32 threads need 32 separate memory transactions—the GPU serializes these, destroying parallelism.
SoA: 32 threads access 32 adjacent floats—hardware coalesces into 1-2 transactions.

GPU Memory Coalescing

See how 32 GPU threads access memory differently with AoS vs SoA layouts

Data Layout

GPU Warp: 32 Threads

Ready

T10

T11

T12

T13

T14

T15

T16

T17

T18

T19

T20

T21

T22

T23

T24

T25

T26

T27

T28

T29

T30

T31

Click "Run Warp" to see memory access pattern

Global Memory Layout

Each row = 1 particle's data. Thread 0 needs x from particle 0.

P0:

P1:

P2:

P3:

P4:

P5:

P6:

P7:

x-values are scattered (stride = 8). Each thread triggers a separate memory transaction.

Transactions

Efficiency

32x

SoA Speedup

Fields:

...|

Accessed

Performance Comparison

When to Use Each Layout

Use Case	AoS	SoA
Batch Processing Processing one field across many objects	poor Must skip over unused fields	excellent Data is perfectly contiguous
SIMD/Vectorization Using CPU vector instructions (AVX)	poor Requires slow gather operations	excellent Direct load of 8+ values
GPU Performance Memory coalescing for parallel threads	poor 32 transactions per warp	excellent 1-2 transactions per warp
Random Access Accessing all fields of a random object	excellent One cache line gets everything	poor Must access multiple arrays
Code Simplicity Natural mapping to OOP concepts	excellent Objects are self-contained	moderate Requires restructuring mindset
Adding/Removing Objects Dynamic collection modifications	excellent Simple array operations	moderate Must update all arrays

Batch Processing

Processing one field across many objects

AoS

poor

Must skip over unused fields

SoA

excellent

Data is perfectly contiguous

SIMD/Vectorization

Using CPU vector instructions (AVX)

AoS

poor

Requires slow gather operations

SoA

excellent

Direct load of 8+ values

GPU Performance

Memory coalescing for parallel threads

AoS

poor

32 transactions per warp

SoA

excellent

1-2 transactions per warp

Random Access

Accessing all fields of a random object

AoS

excellent

One cache line gets everything

SoA

poor

Must access multiple arrays

Code Simplicity

Natural mapping to OOP concepts

AoS

excellent

Objects are self-contained

SoA

moderate

Requires restructuring mindset

Adding/Removing Objects

Dynamic collection modifications

AoS

excellent

Simple array operations

SoA

moderate

Must update all arrays

Choose AoS when...

• Working with individual objects
• Random access patterns dominate
• Object-oriented design is priority

Choose SoA when...

• Batch processing many items
• SIMD/GPU optimization needed
• Columnar data queries

When to Use Each Layout

Choose AoS When:

Object-oriented design is paramount
Random access to complete objects dominates
Small working sets fit in cache
Using pointer-based structures (linked lists, trees)

Choose SoA When:

Batch processing many objects
SIMD optimization is critical
GPU computing (CUDA/OpenCL)
Scientific simulations with large datasets

Different threads writing to arrays that share cache lines causes constant invalidation.

Solution: Pad arrays to cache line boundaries (64 bytes). Use thread-local accumulators.

4. Mixing Layouts Inconsistently

Half the codebase uses AoS, half uses SoA. Constant conversion overhead negates benefits.

Solution: Choose one layout for your hot path and stick with it. Convert at system boundaries only.

Real-World Applications

Domain	Application	Why SoA/AoS
Game Engines	Unity DOTS, Unreal Mass	SoA enables millions of entities at 60fps
Scientific Computing	LAMMPS, GROMACS molecular dynamics	SoA with SIMD achieves 10x+ speedups
Columnar Databases	Apache Parquet, Arrow, DuckDB	SoA (columnar) for efficient analytical queries
Machine Learning	PyTorch, NumPy tensors	SoA for optimal GPU batch processing
Image Processing	FFmpeg planar formats	SoA (planar RGB) enables SIMD color processing
Financial Systems	HFT price feeds	SoA for rapid scanning across instruments

Key Takeaways

Layout is a 10-100x decision — Not a micro-optimization, this is architectural.
SoA wins for batch processing — If you touch one field across many objects, SoA is almost always faster.
AoS wins for random access — If you need all fields of random objects, AoS avoids pointer chasing.
SIMD and GPUs demand SoA — Modern hardware parallelism requires contiguous data to achieve peak performance.
Measure first — Profile cache misses before restructuring. The "wrong" layout for your access pattern costs 8-10x.

Profiling Tools

Use these tools to measure the impact of layout changes:

Intel VTune — CPU cache analysis and memory bandwidth
NVIDIA Nsight — GPU coalescing metrics
Linux perf — perf stat -e cache-misses for quick cache analysis

CPU Cache Lines — Understanding why layout affects cache efficiency
Memory Hierarchy — GPU memory coalescing patterns
CPU Optimization — Broader optimization strategies

SoA vs AoS: Data Layout Optimization

Why Data Layout Matters

The Library Analogy

The Library Analogy

AoS: Traditional Shelving

SoA: Columnar Organization

Understanding the Two Layouts

Array of Structures (AoS)

Structure of Arrays (SoA)

Why Layout Matters: The Cache Efficiency Story

Cache Efficiency Comparison

Interactive Memory Explorer

Memory Layout Explorer

Scenario: Processing Single field (x) for 8 particles

Memory Layout (AOS) — Each card = 1 cache line load

AoS wastes 87.5% of bandwidth!

SIMD Vectorization

SIMD Vectorization Demo

Memory Layout (AOS)

AVX2 Vector Register (256-bit = 8 floats)

GPU Memory Coalescing

GPU Memory Coalescing

GPU Warp: 32 Threads

Global Memory Layout

Performance Comparison

When to Use Each Layout

Choose AoS when...

Choose SoA when...

When to Use Each Layout

Choose AoS When:

Choose SoA When:

Consider AoSoA Hybrid:

Common Pitfalls to Avoid

1. Premature Optimization

2. Forgetting Alignment

4. Mixing Layouts Inconsistently

Real-World Applications

Key Takeaways

Profiling Tools