GPU Programming 101 in Python
A hands-on workshop introducing GPU kernel programming using Triton — write high-performance GPU kernels in pure Python without needing deep CUDA expertise.
Want to book this talk?
GPU Programming 101 in Python
Overview
GPUs have become essential for modern computing — from training deep learning models to processing massive datasets. But writing GPU code has traditionally meant learning CUDA C++, a steep barrier for Python developers. This workshop introduces Triton, a Python-based language for GPU kernel programming that achieves near-CUDA performance while staying in the Python ecosystem. Through three progressively complex examples, attendees learn to write, optimize, and benchmark GPU kernels from scratch.
Workshop Details
- Format: Hands-on workshop with Jupyter notebook
- Expertise Level: Intermediate (requires basic PyTorch and GPU knowledge)
- Presented At: BangPypers December Meetup 2025, Flexera, Bengaluru
- Materials: GitHub Repository
Workshop Content
1. Vector Addition — Core Triton Concepts
The first exercise strips GPU programming down to its essentials. Attendees learn the fundamental building blocks that every Triton kernel uses:
@triton.jit— the decorator that compiles Python to GPU machine codetl.program_id()— how the GPU identifies which block of work each thread handlestl.arange()— creating index ranges for parallel data accesstl.load()/tl.store()— reading from and writing to GPU memory- Masking — safely handling boundary conditions when data doesn't divide evenly into blocks
This exercise establishes the mental model: GPU kernels process data in parallel blocks, and every kernel follows a load-compute-store pattern.
2. Fused Softmax — Why Kernel Fusion Matters
The second exercise demonstrates the single most impactful optimization in GPU programming: kernel fusion. Most GPU workloads are memory-bandwidth limited, not compute limited. A naive softmax implementation reads data from global memory multiple times (for max, subtract, exp, sum, divide). A fused kernel reads the data once, performs all operations in fast registers, and writes the result once.
Key insights covered:
- GPUs can compute far faster than they can read/write memory
- Fusing operations reduces memory traffic by 2-4x
- Numerical stability through max subtraction is critical for production softmax
- Real-world impact: this is the same principle behind Flash Attention
3. Matrix Multiplication — Tiling and Autotuning
The final exercise tackles the most compute-intensive operation in deep learning. Attendees implement a tiled matrix multiplication that leverages:
- 2D tiling — splitting matrices into blocks that fit in fast shared memory
tl.dot()— utilizing tensor cores for 10x+ speedup on matrix operations- L2 cache optimization — "swizzling" block access patterns to maximize cache hits
@triton.autotune— automatically searching for the optimal tile size, number of warps, and pipeline stages for the target GPU
This exercise shows that with the right tiling strategy, a Triton kernel can match or approach cuBLAS performance.
Key Concepts Summary
| Concept | Why It Matters |
|---|---|
| Masking | Safe boundary handling without branching |
| Kernel Fusion | Reduces memory bandwidth bottleneck |
| Tiling | Enables data reuse in fast on-chip memory |
| Autotuning | Optimal configuration varies by GPU hardware |
| Tensor Cores | 10x+ speedup for matrix operations via tl.dot() |
Prerequisites
- Python 3.8+
- Basic PyTorch knowledge
- Understanding of GPU concepts (threads, blocks, memory hierarchy)
- NVIDIA GPU with CUDA support
Target Audience
This workshop is designed for Python developers who want to move beyond PyTorch's high-level APIs and understand what happens on the GPU. It's especially valuable for ML engineers optimizing inference pipelines, researchers writing custom CUDA operations, and anyone curious about how GPU kernels actually work — without leaving Python.
Key Takeaways
Attendees will learn:
- How to write GPU kernels in pure Python using Triton
- The load-compute-store pattern that underlies all GPU programming
- Why kernel fusion delivers massive speedups for memory-bound operations
- How tiling strategies enable efficient matrix operations
- How to use autotuning to optimize kernels for specific GPU hardware
- How to integrate custom Triton kernels with PyTorch
Challenge Exercises
For attendees who want to go further after the workshop:
- Fused Multiply-Add — compute
x * y + zin a single kernel - Layer Normalization — implement fused LayerNorm
- Flash Attention — build a simplified attention kernel
Technical Setup
pip install triton torch
Tested with Triton 3.1.0, PyTorch 2.5.0+cu118, and CUDA 11.8.
Audience Feedback
“Abhik's talk on this topic was enlightening and practical. The audience was engaged throughout and left with actionable insights they could apply immediately.”
Interested in booking this talk?
I'd love to bring this topic to your event! Get in touch to discuss logistics, timing, and any specific areas you'd like me to focus on.

