ArrPy: Array You Fast Enough?
A hands-on workshop rebuilding NumPy from scratch — progressing from pure Python loops through Cython memoryviews to SIMD-optimized C++ via pybind11, with live benchmarking at each stage.
Want to book this talk?
ArrPy: Array You Fast Enough?
Overview
What does it take to build NumPy from scratch? And how fast can you make it? This 3-hour workshop answers both questions by guiding participants through a complete reimplementation of NumPy's core — progressing from naive Python loops to SIMD-vectorized C++ that achieves up to 100x speedups. At each stage, participants benchmark their code on a live leaderboard, making the performance impact of every optimization technique immediately tangible.
Co-presented with Anivesh Pandey.
Workshop Details
- Format: 3-hour hands-on workshop
- Expertise Level: Intermediate
- Presented At: PyCon India 2025, Bengaluru
- Co-presenter: Anivesh Pandey
- Materials: GitHub Repository
The Optimization Journey
Stage 1: Pure Python — Understanding the Algorithms
The workshop starts with a complete array library written in pure Python. Participants implement element-wise operations, broadcasting, and matrix multiplication using nothing but Python lists and loops. The goal isn't speed — it's understanding exactly what NumPy does under the hood.
arrpy.set_backend('python') # Addition of 1M elements: ~245ms # Matrix multiply 500x500: ~1824ms
At this stage, everything is readable and debuggable. Participants can step through matrix multiplication line by line, understanding how broadcasting rules work and why naive nested loops are slow.
Stage 2: Cython — Type Annotations and Memory Views
Next, participants rewrite the hot paths in Cython, learning how static type declarations and typed memory views eliminate Python's interpreter overhead:
- Static typing — telling the compiler exactly what types to expect
- Memory views — direct access to array memory without Python object overhead
- Buffer protocol — zero-copy data sharing between Python and C
- Parallel reductions — using
prangewith OpenMP for multi-core execution
arrpy.set_backend('cython') # Addition of 1M elements: ~19ms (13x faster) # Matrix multiply 500x500: ~156ms (12x faster)
The key insight: simply adding type information to the same algorithm yields an order of magnitude improvement.
Stage 3: C++ with SIMD — Maximum Performance
The final stage introduces native C++ extensions via pybind11, with SIMD vectorization using AVX2 (x86) and NEON (ARM) intrinsics:
- SIMD vectorization — processing 4-8 floats per instruction
- Cache-aware tiling — structuring memory access to maximize L1/L2 cache hits
- pybind11 integration — exposing C++ functions to Python with minimal boilerplate
arrpy.set_backend('c') # Addition of 1M elements: ~0.7ms (350x faster than Python) # Matrix multiply 500x500: ~8.3ms (220x faster than Python)
Performance Results
| Operation | Python | Cython | C++ (SIMD) | Speedup |
|---|---|---|---|---|
| Addition (1M elements) | 245ms | 19ms | 0.7ms | 350x |
| Matrix Multiply (500x500) | 1824ms | 156ms | 8.3ms | 220x |
| Sum (1M elements) | 187ms | 12ms | — | 16x |
| Fancy Indexing | 15ms | — | — | — |
What ArrPy Implements
The library covers 80+ NumPy-compatible operations across three backends:
- Array creation —
array,zeros,ones,eye,arange,linspace - Mathematical ops — element-wise arithmetic, trigonometric, exponential, logarithmic
- Linear algebra — matrix multiply, LU decomposition, solve, determinant
- Statistical functions — mean, std, var, percentile
- Broadcasting — full NumPy-compatible broadcasting rules
- Advanced indexing — fancy indexing, slicing, boolean masks
Prerequisites
- Intermediate Python knowledge
- Basic understanding of how arrays work in memory
- Familiarity with profiling concepts (helpful but not required)
- No C/C++ experience required — the workshop introduces it progressively
Target Audience
This workshop is for Python developers who want to understand the performance spectrum — from "why is my Python code slow?" to "how does NumPy achieve near-C speed?" It's especially valuable for data scientists writing custom operations, library authors building Python extensions, and anyone curious about what happens below Python's abstraction layer.
Key Takeaways
Attendees will learn:
- How NumPy's core operations work internally
- Why type information matters so much for performance
- How to write Cython extensions with typed memory views
- How to create C++ Python extensions using pybind11
- What SIMD vectorization is and when it applies
- How to profile and benchmark Python code systematically
- The tradeoffs between development speed and runtime performance
Audience Feedback
“Abhik's talk on this topic was enlightening and practical. The audience was engaged throughout and left with actionable insights they could apply immediately.”
Additional Resources
Interested in booking this talk?
I'd love to bring this topic to your event! Get in touch to discuss logistics, timing, and any specific areas you'd like me to focus on.

