GPU Programming 101 in Python

Overview

GPUs have become essential for modern computing — from training deep learning models to processing massive datasets. But writing GPU code has traditionally meant learning CUDA C++, a steep barrier for Python developers. This workshop introduces Triton, a Python-based language for GPU kernel programming that achieves near-CUDA performance while staying in the Python ecosystem. Through three progressively complex examples, attendees learn to write, optimize, and benchmark GPU kernels from scratch.

Workshop Details

Format: Hands-on workshop with Jupyter notebook
Expertise Level: Intermediate (requires basic PyTorch and GPU knowledge)
Presented At: BangPypers December Meetup 2025, Flexera, Bengaluru
Materials: GitHub Repository

Workshop Content

1. Vector Addition — Core Triton Concepts

The first exercise strips GPU programming down to its essentials. Attendees learn the fundamental building blocks that every Triton kernel uses:

@triton.jit — the decorator that compiles Python to GPU machine code
tl.program_id() — how the GPU identifies which block of work each thread handles
tl.arange() — creating index ranges for parallel data access
tl.load() / tl.store() — reading from and writing to GPU memory
Masking — safely handling boundary conditions when data doesn't divide evenly into blocks

This exercise establishes the mental model: GPU kernels process data in parallel blocks, and every kernel follows a load-compute-store pattern.

2. Fused Softmax — Why Kernel Fusion Matters

The second exercise demonstrates the single most impactful optimization in GPU programming: kernel fusion. Most GPU workloads are memory-bandwidth limited, not compute limited. A naive softmax implementation reads data from global memory multiple times (for max, subtract, exp, sum, divide). A fused kernel reads the data once, performs all operations in fast registers, and writes the result once.

Key insights covered:

GPUs can compute far faster than they can read/write memory
Fusing operations reduces memory traffic by 2-4x
Numerical stability through max subtraction is critical for production softmax
Real-world impact: this is the same principle behind Flash Attention

3. Matrix Multiplication — Tiling and Autotuning

The final exercise tackles the most compute-intensive operation in deep learning. Attendees implement a tiled matrix multiplication that leverages:

2D tiling — splitting matrices into blocks that fit in fast shared memory
tl.dot() — utilizing tensor cores for 10x+ speedup on matrix operations
L2 cache optimization — "swizzling" block access patterns to maximize cache hits
@triton.autotune — automatically searching for the optimal tile size, number of warps, and pipeline stages for the target GPU

This exercise shows that with the right tiling strategy, a Triton kernel can match or approach cuBLAS performance.

Key Concepts Summary

Concept	Why It Matters
Masking	Safe boundary handling without branching
Kernel Fusion	Reduces memory bandwidth bottleneck
Tiling	Enables data reuse in fast on-chip memory
Autotuning	Optimal configuration varies by GPU hardware
Tensor Cores	10x+ speedup for matrix operations via `tl.dot()`

Prerequisites

Python 3.8+
Basic PyTorch knowledge
Understanding of GPU concepts (threads, blocks, memory hierarchy)
NVIDIA GPU with CUDA support

Target Audience

This workshop is designed for Python developers who want to move beyond PyTorch's high-level APIs and understand what happens on the GPU. It's especially valuable for ML engineers optimizing inference pipelines, researchers writing custom CUDA operations, and anyone curious about how GPU kernels actually work — without leaving Python.

Key Takeaways

Attendees will learn:

How to write GPU kernels in pure Python using Triton
The load-compute-store pattern that underlies all GPU programming
Why kernel fusion delivers massive speedups for memory-bound operations
How tiling strategies enable efficient matrix operations
How to use autotuning to optimize kernels for specific GPU hardware
How to integrate custom Triton kernels with PyTorch

Challenge Exercises

For attendees who want to go further after the workshop:

Fused Multiply-Add — compute x * y + z in a single kernel
Layer Normalization — implement fused LayerNorm
Flash Attention — build a simplified attention kernel

Technical Setup

pip install triton torch

Tested with Triton 3.1.0, PyTorch 2.5.0+cu118, and CUDA 11.8.

GPU Programming 101 in Python

Want to book this talk?

GPU Programming 101 in Python

Overview

Workshop Details

Workshop Content

1. Vector Addition — Core Triton Concepts

2. Fused Softmax — Why Kernel Fusion Matters

3. Matrix Multiplication — Tiling and Autotuning

Key Concepts Summary

Prerequisites

Target Audience

Key Takeaways

Challenge Exercises

Technical Setup

Audience Feedback

Share this talk

Additional Resources

Interested in booking this talk?

About the Speaker

Abhik Sarkar