Previously Presented

ArrPy: Array You Fast Enough?

A hands-on workshop rebuilding NumPy from scratch — progressing from pure Python loops through Cython memoryviews to SIMD-optimized C++ via pybind11, with live benchmarking at each stage.

180 minutes
Bengaluru, India
Intermediate
Previously presentedThis talk has been well-received at previous events

Want to book this talk?

ArrPy: Array You Fast Enough?

Overview

What does it take to build NumPy from scratch? And how fast can you make it? This 3-hour workshop answers both questions by guiding participants through a complete reimplementation of NumPy's core — progressing from naive Python loops to SIMD-vectorized C++ that achieves up to 100x speedups. At each stage, participants benchmark their code on a live leaderboard, making the performance impact of every optimization technique immediately tangible.

Co-presented with Anivesh Pandey.

Workshop Details

  • Format: 3-hour hands-on workshop
  • Expertise Level: Intermediate
  • Presented At: PyCon India 2025, Bengaluru
  • Co-presenter: Anivesh Pandey
  • Materials: GitHub Repository

The Optimization Journey

Stage 1: Pure Python — Understanding the Algorithms

The workshop starts with a complete array library written in pure Python. Participants implement element-wise operations, broadcasting, and matrix multiplication using nothing but Python lists and loops. The goal isn't speed — it's understanding exactly what NumPy does under the hood.

arrpy.set_backend('python') # Addition of 1M elements: ~245ms # Matrix multiply 500x500: ~1824ms

At this stage, everything is readable and debuggable. Participants can step through matrix multiplication line by line, understanding how broadcasting rules work and why naive nested loops are slow.

Stage 2: Cython — Type Annotations and Memory Views

Next, participants rewrite the hot paths in Cython, learning how static type declarations and typed memory views eliminate Python's interpreter overhead:

  • Static typing — telling the compiler exactly what types to expect
  • Memory views — direct access to array memory without Python object overhead
  • Buffer protocol — zero-copy data sharing between Python and C
  • Parallel reductions — using prange with OpenMP for multi-core execution
arrpy.set_backend('cython') # Addition of 1M elements: ~19ms (13x faster) # Matrix multiply 500x500: ~156ms (12x faster)

The key insight: simply adding type information to the same algorithm yields an order of magnitude improvement.

Stage 3: C++ with SIMD — Maximum Performance

The final stage introduces native C++ extensions via pybind11, with SIMD vectorization using AVX2 (x86) and NEON (ARM) intrinsics:

  • SIMD vectorization — processing 4-8 floats per instruction
  • Cache-aware tiling — structuring memory access to maximize L1/L2 cache hits
  • pybind11 integration — exposing C++ functions to Python with minimal boilerplate
arrpy.set_backend('c') # Addition of 1M elements: ~0.7ms (350x faster than Python) # Matrix multiply 500x500: ~8.3ms (220x faster than Python)

Performance Results

OperationPythonCythonC++ (SIMD)Speedup
Addition (1M elements)245ms19ms0.7ms350x
Matrix Multiply (500x500)1824ms156ms8.3ms220x
Sum (1M elements)187ms12ms16x
Fancy Indexing15ms

What ArrPy Implements

The library covers 80+ NumPy-compatible operations across three backends:

  • Array creationarray, zeros, ones, eye, arange, linspace
  • Mathematical ops — element-wise arithmetic, trigonometric, exponential, logarithmic
  • Linear algebra — matrix multiply, LU decomposition, solve, determinant
  • Statistical functions — mean, std, var, percentile
  • Broadcasting — full NumPy-compatible broadcasting rules
  • Advanced indexing — fancy indexing, slicing, boolean masks

Prerequisites

  • Intermediate Python knowledge
  • Basic understanding of how arrays work in memory
  • Familiarity with profiling concepts (helpful but not required)
  • No C/C++ experience required — the workshop introduces it progressively

Target Audience

This workshop is for Python developers who want to understand the performance spectrum — from "why is my Python code slow?" to "how does NumPy achieve near-C speed?" It's especially valuable for data scientists writing custom operations, library authors building Python extensions, and anyone curious about what happens below Python's abstraction layer.

Key Takeaways

Attendees will learn:

  • How NumPy's core operations work internally
  • Why type information matters so much for performance
  • How to write Cython extensions with typed memory views
  • How to create C++ Python extensions using pybind11
  • What SIMD vectorization is and when it applies
  • How to profile and benchmark Python code systematically
  • The tradeoffs between development speed and runtime performance

Audience Feedback

“Abhik's talk on this topic was enlightening and practical. The audience was engaged throughout and left with actionable insights they could apply immediately.”
— Conference Organizer, PyCon 2023

Share this talk

Additional Resources

Interested in booking this talk?

I'd love to bring this topic to your event! Get in touch to discuss logistics, timing, and any specific areas you'd like me to focus on.

About the Speaker

Abhik Sarkar

Abhik Sarkar

AI researcher and engineer specializing in machine learning systems. Passionate about making complex AI concepts accessible.

Mastodon