ArrPy: Array You Fast Enough?

Overview

What does it take to build NumPy from scratch? And how fast can you make it? This 3-hour workshop answers both questions by guiding participants through a complete reimplementation of NumPy's core — progressing from naive Python loops to SIMD-vectorized C++ that achieves up to 100x speedups. At each stage, participants benchmark their code on a live leaderboard, making the performance impact of every optimization technique immediately tangible.

Co-presented with Anivesh Pandey.

Workshop Details

Format: 3-hour hands-on workshop
Expertise Level: Intermediate
Presented At: PyCon India 2025, Bengaluru
Co-presenter: Anivesh Pandey
Materials: GitHub Repository

The Optimization Journey

Stage 1: Pure Python — Understanding the Algorithms

The workshop starts with a complete array library written in pure Python. Participants implement element-wise operations, broadcasting, and matrix multiplication using nothing but Python lists and loops. The goal isn't speed — it's understanding exactly what NumPy does under the hood.

arrpy.set_backend('python')
# Addition of 1M elements: ~245ms
# Matrix multiply 500x500: ~1824ms

At this stage, everything is readable and debuggable. Participants can step through matrix multiplication line by line, understanding how broadcasting rules work and why naive nested loops are slow.

Stage 2: Cython — Type Annotations and Memory Views

Next, participants rewrite the hot paths in Cython, learning how static type declarations and typed memory views eliminate Python's interpreter overhead:

Static typing — telling the compiler exactly what types to expect
Memory views — direct access to array memory without Python object overhead
Buffer protocol — zero-copy data sharing between Python and C
Parallel reductions — using prange with OpenMP for multi-core execution

arrpy.set_backend('cython')
# Addition of 1M elements: ~19ms (13x faster)
# Matrix multiply 500x500: ~156ms (12x faster)

The key insight: simply adding type information to the same algorithm yields an order of magnitude improvement.

Stage 3: C++ with SIMD — Maximum Performance

The final stage introduces native C++ extensions via pybind11, with SIMD vectorization using AVX2 (x86) and NEON (ARM) intrinsics:

SIMD vectorization — processing 4-8 floats per instruction
Cache-aware tiling — structuring memory access to maximize L1/L2 cache hits
pybind11 integration — exposing C++ functions to Python with minimal boilerplate

arrpy.set_backend('c')
# Addition of 1M elements: ~0.7ms (350x faster than Python)
# Matrix multiply 500x500: ~8.3ms (220x faster than Python)

Performance Results

Operation	Python	Cython	C++ (SIMD)	Speedup
Addition (1M elements)	245ms	19ms	0.7ms	350x
Matrix Multiply (500x500)	1824ms	156ms	8.3ms	220x
Sum (1M elements)	187ms	12ms	—	16x
Fancy Indexing	15ms	—	—	—

What ArrPy Implements

The library covers 80+ NumPy-compatible operations across three backends:

Array creation — array, zeros, ones, eye, arange, linspace
Mathematical ops — element-wise arithmetic, trigonometric, exponential, logarithmic
Linear algebra — matrix multiply, LU decomposition, solve, determinant
Statistical functions — mean, std, var, percentile
Broadcasting — full NumPy-compatible broadcasting rules
Advanced indexing — fancy indexing, slicing, boolean masks

Prerequisites

Intermediate Python knowledge
Basic understanding of how arrays work in memory
Familiarity with profiling concepts (helpful but not required)
No C/C++ experience required — the workshop introduces it progressively

Target Audience

This workshop is for Python developers who want to understand the performance spectrum — from "why is my Python code slow?" to "how does NumPy achieve near-C speed?" It's especially valuable for data scientists writing custom operations, library authors building Python extensions, and anyone curious about what happens below Python's abstraction layer.

Key Takeaways

Attendees will learn:

How NumPy's core operations work internally
Why type information matters so much for performance
How to write Cython extensions with typed memory views
How to create C++ Python extensions using pybind11
What SIMD vectorization is and when it applies
How to profile and benchmark Python code systematically
The tradeoffs between development speed and runtime performance

ArrPy: Array You Fast Enough?

Want to book this talk?

ArrPy: Array You Fast Enough?

Overview

Workshop Details

The Optimization Journey

Stage 1: Pure Python — Understanding the Algorithms

Stage 2: Cython — Type Annotations and Memory Views

Stage 3: C++ with SIMD — Maximum Performance

Performance Results

What ArrPy Implements

Prerequisites

Target Audience

Key Takeaways

Audience Feedback

Share this talk

Additional Resources

Interested in booking this talk?

About the Speaker

Abhik Sarkar