SoA vs AoS: Data Layout Optimization

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

Best viewed on desktop for optimal interactive experience

Why Data Layout Matters

When storing collections of multi-field data—particles, vertices, database records—the memory layout choice between Array of Structures (AoS) and Structure of Arrays (SoA) can result in 10-100x performance differences. This single architectural decision affects CPU cache efficiency, SIMD vectorization, and GPU memory coalescing.

The Library Analogy

Imagine organizing a library of books, where each book has: title, author, year, and genre.

AoS (Traditional Shelving): Each book sits together with all its information on one shelf card.

  • To find all titles? You must visit every single shelf and read each card.
  • Great when you need everything about one specific book.

SoA (Columnar Organization): All titles on one shelf, all authors on another, all years on a third.

  • To find all titles? Just visit the titles shelf—done!
  • Perfect when you only need one piece of information from every book.

This is exactly how CPUs access memory. SoA lets the CPU grab what it needs without wading through irrelevant data.

The Library Analogy

Imagine organizing a library. Which layout minimizes shelf visits?

Scan every book title in the library

AoS: Traditional Shelving

Each shelf = one complete book record

0
visits
Book 1
Clean Code
R. Martin
2008
Tech
Book 2
SICP
Abelson
1984
CS
Book 3
TAOCP
Knuth
1968
Algo
Book 4
Design Patterns
GoF
1994
Tech
SoA: Columnar Organization

Each shelf = one field type from all books

0
visits
Titles
Clean Code
SICP
TAOCP
Design Patterns
Authors
R. Martin
Abelson
Knuth
GoF
Years
2008
1984
1968
1994
Genres
Tech
CS
Algo
Tech
Fields:
title
author
year
genre

Understanding the Two Layouts

Array of Structures (AoS)

Groups all fields of each object together in memory. Each particle's x, y, z, velocity, and mass are stored contiguously. Natural for object-oriented thinking.

Structure of Arrays (SoA)

Groups each field into separate contiguous arrays. All x-values together, all y-values together. Optimal for batch processing and SIMD operations.

Why Layout Matters: The Cache Efficiency Story

  1. CPU requests a single value — You ask for particle[0].x—just 4 bytes of data.

  2. Hardware loads entire cache line — The CPU doesn't fetch 4 bytes. It loads a full 64-byte cache line containing that address.

  3. Layout determines what comes along — With AoS, you get x, y, z, vx, vy, vz, mass, charge for ONE particle (useful if you need all fields). With SoA, you get x₀, x₁, x₂... x₁₅ for 16 particles (useful if processing all x-values).

  4. Unused data wastes bandwidth — If you only need x-values across all particles, AoS wastes 87.5% of loaded data. SoA uses 100%.

Cache Efficiency Comparison

Access PatternAoS EfficiencySoA Efficiency
All fields of one object100%12.5%
Position only (x, y, z)37.5%100%
Single field (x) across all12.5%100%

The pattern is clear: AoS wins for random access to complete objects. SoA wins overwhelmingly for batch operations on specific fields—which is the common case in simulations, games, and data processing.

Interactive Memory Explorer

Explore how different data layouts affect memory access patterns and cache efficiency:

Memory Layout Explorer

See how data layout affects what the CPU loads into cache

Scenario: Processing Single field (x) for 8 particles

You need to update the x-position of every particle. Watch how much extra data gets loaded.

Memory Layout (AOS) — Each card = 1 cache line load
Line 1
Particle 0 (x,y,z,vx)
xP0
yP0
zP0
vxP0
Line 2
Particle 0 (vy,vz,m,c)
vyP0
vzP0
mP0
cP0
Line 3
Particle 1 (x,y,z,vx)
xP1
yP1
zP1
vxP1
Line 4
Particle 1 (vy,vz,m,c)
vyP1
vzP1
mP1
cP1
Line 5
Particle 2 (x,y,z,vx)
xP2
yP2
zP2
vxP2
Line 6
Particle 2 (vy,vz,m,c)
vyP2
vzP2
mP2
cP2
Line 7
Particle 3 (x,y,z,vx)
xP3
yP3
zP3
vxP3
Line 8
Particle 3 (vy,vz,m,c)
vyP3
vzP3
mP3
cP3
Line 9
Particle 4 (x,y,z,vx)
xP4
yP4
zP4
vxP4
Line 10
Particle 4 (vy,vz,m,c)
vyP4
vzP4
mP4
cP4
Line 11
Particle 5 (x,y,z,vx)
xP5
yP5
zP5
vxP5
Line 12
Particle 5 (vy,vz,m,c)
vyP5
vzP5
mP5
cP5
Line 13
Particle 6 (x,y,z,vx)
xP6
yP6
zP6
vxP6
Line 14
Particle 6 (vy,vz,m,c)
vyP6
vzP6
mP6
cP6
Line 15
Particle 7 (x,y,z,vx)
xP7
yP7
zP7
vxP7
Line 16
Particle 7 (vy,vz,m,c)
vyP7
vzP7
mP7
cP7
Useful (needed)
Wasted (loaded but not needed)
Needed (waiting to load)
8
Cache Lines
8
Useful Values
24
Wasted Values
25%
Efficiency
AoS wastes 87.5% of bandwidth!

With AoS, each cache line brings in data for one particle. When you only need x values, the other fields are loaded but never used.

SIMD Vectorization

Modern processors don't just process one value at a time. SIMD (Single Instruction, Multiple Data) processes 4-16 values simultaneously using vector instructions like AVX2.

The Problem with AoS: To process 8 x-values, the CPU must gather them from 8 different memory locations (scattered across 8 particle structures). This "gather" operation is slow.

The SoA Advantage: All 8 x-values are already adjacent in memory. One instruction loads all 8. One instruction processes all 8. One instruction stores all 8.

SIMD Vectorization Demo

Watch how CPU loads 8 x-values using AVX2 vector instructions

Memory Layout (AOS)
x
y
z
vx
vy
vz
m
c
x
y
z
vx
vy
vz
m
c
x
y
z
vx
vy
vz
m
c
x
y
z
vx
vy
vz
m
c
Scattered access (stride = 8)
AVX2 Vector Register (256-bit = 8 floats)
vgatherdps ymm0, [rax + idx*32] // 8 separate fetches
8
Instructions
~24
CPU Cycles
12.5%
Bandwidth Used
Fields:
x
y
z
vx
vy
vz
m
c
|
Target (x values)

GPU Memory Coalescing

GPUs are even more sensitive to memory layout. A warp of 32 threads accessing data:

  • AoS: 32 threads need 32 separate memory transactions—the GPU serializes these, destroying parallelism.
  • SoA: 32 threads access 32 adjacent floats—hardware coalesces into 1-2 transactions.

GPU Memory Coalescing

See how 32 GPU threads access memory differently with AoS vs SoA layouts

GPU Warp: 32 Threads
Ready
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31

Click "Run Warp" to see memory access pattern

Global Memory Layout

Each row = 1 particle's data. Thread 0 needs x from particle 0.

P0:
x
y
z
vx
vy
vz
m
c
P1:
x
y
z
vx
vy
vz
m
c
P2:
x
y
z
vx
vy
vz
m
c
P3:
x
y
z
vx
vy
vz
m
c
P4:
x
y
z
vx
vy
vz
m
c
P5:
x
y
z
vx
vy
vz
m
c
P6:
x
y
z
vx
vy
vz
m
c
P7:
x
y
z
vx
vy
vz
m
c

x-values are scattered (stride = 8). Each thread triggers a separate memory transaction.

0
Transactions
3%
Efficiency
32x
SoA Speedup
Fields:
x
y
z
vx
...|
Accessed

Performance Comparison

When to Use Each Layout

Batch Processing
Processing one field across many objects
AoS
poor
Must skip over unused fields
SoA
excellent
Data is perfectly contiguous
SIMD/Vectorization
Using CPU vector instructions (AVX)
AoS
poor
Requires slow gather operations
SoA
excellent
Direct load of 8+ values
GPU Performance
Memory coalescing for parallel threads
AoS
poor
32 transactions per warp
SoA
excellent
1-2 transactions per warp
Random Access
Accessing all fields of a random object
AoS
excellent
One cache line gets everything
SoA
poor
Must access multiple arrays
Code Simplicity
Natural mapping to OOP concepts
AoS
excellent
Objects are self-contained
SoA
moderate
Requires restructuring mindset
Adding/Removing Objects
Dynamic collection modifications
AoS
excellent
Simple array operations
SoA
moderate
Must update all arrays
Choose AoS when...
  • • Working with individual objects
  • • Random access patterns dominate
  • • Object-oriented design is priority
Choose SoA when...
  • • Batch processing many items
  • • SIMD/GPU optimization needed
  • • Columnar data queries

When to Use Each Layout

Choose AoS When:

  • Object-oriented design is paramount
  • Random access to complete objects dominates
  • Small working sets fit in cache
  • Using pointer-based structures (linked lists, trees)

Choose SoA When:

  • Batch processing many objects
  • SIMD optimization is critical
  • GPU computing (CUDA/OpenCL)
  • Scientific simulations with large datasets

Consider AoSoA Hybrid:

Group objects into SIMD-width blocks (8 for AVX2, 16 for AVX-512). Each block uses SoA internally. This provides cache locality of AoS with vectorization benefits of SoA.

Common Pitfalls to Avoid

1. Premature Optimization

Converting to SoA without measuring first. Memory layout only matters when memory bandwidth is the bottleneck.

Solution: Profile with cache miss counters before restructuring. If computation-bound, layout won't help.

2. Forgetting Alignment

SIMD instructions require data aligned to 16/32/64-byte boundaries. Unaligned access causes crashes or severe slowdowns.

Solution: Use alignas(32) or aligned allocators. Ensure array sizes are multiples of SIMD width.

3. False Sharing in Multi-threaded Code

Different threads writing to arrays that share cache lines causes constant invalidation.

Solution: Pad arrays to cache line boundaries (64 bytes). Use thread-local accumulators.

4. Mixing Layouts Inconsistently

Half the codebase uses AoS, half uses SoA. Constant conversion overhead negates benefits.

Solution: Choose one layout for your hot path and stick with it. Convert at system boundaries only.

Real-World Applications

DomainApplicationWhy SoA/AoS
Game EnginesUnity DOTS, Unreal MassSoA enables millions of entities at 60fps
Scientific ComputingLAMMPS, GROMACS molecular dynamicsSoA with SIMD achieves 10x+ speedups
Columnar DatabasesApache Parquet, Arrow, DuckDBSoA (columnar) for efficient analytical queries
Machine LearningPyTorch, NumPy tensorsSoA for optimal GPU batch processing
Image ProcessingFFmpeg planar formatsSoA (planar RGB) enables SIMD color processing
Financial SystemsHFT price feedsSoA for rapid scanning across instruments

Key Takeaways

  1. Layout is a 10-100x decision — Not a micro-optimization, this is architectural.

  2. SoA wins for batch processing — If you touch one field across many objects, SoA is almost always faster.

  3. AoS wins for random access — If you need all fields of random objects, AoS avoids pointer chasing.

  4. SIMD and GPUs demand SoA — Modern hardware parallelism requires contiguous data to achieve peak performance.

  5. Measure first — Profile cache misses before restructuring. The "wrong" layout for your access pattern costs 8-10x.

Profiling Tools

Use these tools to measure the impact of layout changes:

  • Intel VTune — CPU cache analysis and memory bandwidth
  • NVIDIA Nsight — GPU coalescing metrics
  • Linux perfperf stat -e cache-misses for quick cache analysis

If you found this explanation helpful, consider sharing it with others.

Mastodon