Previously Presented

GPU Programming 101 in Python

A hands-on workshop introducing GPU kernel programming using Triton — write high-performance GPU kernels in pure Python without needing deep CUDA expertise.

Workshop
Flexera, Indiqube Logos, MG Road, Bengaluru
Intermediate
Previously presentedThis talk has been well-received at previous events

Want to book this talk?

GPU Programming 101 in Python

Overview

GPUs have become essential for modern computing — from training deep learning models to processing massive datasets. But writing GPU code has traditionally meant learning CUDA C++, a steep barrier for Python developers. This workshop introduces Triton, a Python-based language for GPU kernel programming that achieves near-CUDA performance while staying in the Python ecosystem. Through three progressively complex examples, attendees learn to write, optimize, and benchmark GPU kernels from scratch.

Workshop Details

  • Format: Hands-on workshop with Jupyter notebook
  • Expertise Level: Intermediate (requires basic PyTorch and GPU knowledge)
  • Presented At: BangPypers December Meetup 2025, Flexera, Bengaluru
  • Materials: GitHub Repository

Workshop Content

1. Vector Addition — Core Triton Concepts

The first exercise strips GPU programming down to its essentials. Attendees learn the fundamental building blocks that every Triton kernel uses:

  • @triton.jit — the decorator that compiles Python to GPU machine code
  • tl.program_id() — how the GPU identifies which block of work each thread handles
  • tl.arange() — creating index ranges for parallel data access
  • tl.load() / tl.store() — reading from and writing to GPU memory
  • Masking — safely handling boundary conditions when data doesn't divide evenly into blocks

This exercise establishes the mental model: GPU kernels process data in parallel blocks, and every kernel follows a load-compute-store pattern.

2. Fused Softmax — Why Kernel Fusion Matters

The second exercise demonstrates the single most impactful optimization in GPU programming: kernel fusion. Most GPU workloads are memory-bandwidth limited, not compute limited. A naive softmax implementation reads data from global memory multiple times (for max, subtract, exp, sum, divide). A fused kernel reads the data once, performs all operations in fast registers, and writes the result once.

Key insights covered:

  • GPUs can compute far faster than they can read/write memory
  • Fusing operations reduces memory traffic by 2-4x
  • Numerical stability through max subtraction is critical for production softmax
  • Real-world impact: this is the same principle behind Flash Attention

3. Matrix Multiplication — Tiling and Autotuning

The final exercise tackles the most compute-intensive operation in deep learning. Attendees implement a tiled matrix multiplication that leverages:

  • 2D tiling — splitting matrices into blocks that fit in fast shared memory
  • tl.dot() — utilizing tensor cores for 10x+ speedup on matrix operations
  • L2 cache optimization — "swizzling" block access patterns to maximize cache hits
  • @triton.autotune — automatically searching for the optimal tile size, number of warps, and pipeline stages for the target GPU

This exercise shows that with the right tiling strategy, a Triton kernel can match or approach cuBLAS performance.

Key Concepts Summary

ConceptWhy It Matters
MaskingSafe boundary handling without branching
Kernel FusionReduces memory bandwidth bottleneck
TilingEnables data reuse in fast on-chip memory
AutotuningOptimal configuration varies by GPU hardware
Tensor Cores10x+ speedup for matrix operations via tl.dot()

Prerequisites

  • Python 3.8+
  • Basic PyTorch knowledge
  • Understanding of GPU concepts (threads, blocks, memory hierarchy)
  • NVIDIA GPU with CUDA support

Target Audience

This workshop is designed for Python developers who want to move beyond PyTorch's high-level APIs and understand what happens on the GPU. It's especially valuable for ML engineers optimizing inference pipelines, researchers writing custom CUDA operations, and anyone curious about how GPU kernels actually work — without leaving Python.

Key Takeaways

Attendees will learn:

  • How to write GPU kernels in pure Python using Triton
  • The load-compute-store pattern that underlies all GPU programming
  • Why kernel fusion delivers massive speedups for memory-bound operations
  • How tiling strategies enable efficient matrix operations
  • How to use autotuning to optimize kernels for specific GPU hardware
  • How to integrate custom Triton kernels with PyTorch

Challenge Exercises

For attendees who want to go further after the workshop:

  1. Fused Multiply-Add — compute x * y + z in a single kernel
  2. Layer Normalization — implement fused LayerNorm
  3. Flash Attention — build a simplified attention kernel

Technical Setup

pip install triton torch

Tested with Triton 3.1.0, PyTorch 2.5.0+cu118, and CUDA 11.8.

Audience Feedback

“Abhik's talk on this topic was enlightening and practical. The audience was engaged throughout and left with actionable insights they could apply immediately.”
— Conference Organizer, PyCon 2023

Share this talk

Additional Resources

Interested in booking this talk?

I'd love to bring this topic to your event! Get in touch to discuss logistics, timing, and any specific areas you'd like me to focus on.

About the Speaker

Abhik Sarkar

Abhik Sarkar

AI researcher and engineer specializing in machine learning systems. Passionate about making complex AI concepts accessible.

Mastodon