Skip to main content

Sitemap

A visual representation of the site structure to help you navigate through the content.

Site Structure

Main landing page with introduction and recent articles

About/about

Learn more about me, my background, and expertise

Speaking/speaking

My talks, presentations, and speaking engagements

Articles/articles

Collection of articles I've written on various topics

Nvidia xid errors/articles/nvidia-xid-errors

Article content

Gpu xid31 mmu faults/articles/gpu-xid31-mmu-faults

Article content

Numerical sensitivity/articles/numerical-sensitivity

Article content

Sam multi mask ambiguity/articles/sam-multi-mask-ambiguity

Article content

Visualizing yolov11/articles/visualizing-yolov11

Article content

H264 implementation applications/articles/h264-implementation-applications

Article content

H264 transform quantization/articles/h264-transform-quantization

Article content

H264 fundamentals/articles/h264-fundamentals

Article content

Zettel/articles/zettel

Article content

Compiling pytorch kernel/articles/compiling-pytorch-kernel

Article content

View size not compatible/articles/view-size-not-compatible

Article content

Gpu boot errors/articles/gpu-boot-errors

Article content

H264 interactive guide/articles/h264-interactive-guide

Article content

Ggml structure/articles/ggml-structure

Article content

Quantization deep dive/articles/quantization-deep-dive

Article content

How tensorrt works/articles/how-tensorrt-works

Article content

Kernel fusion/articles/kernel-fusion

Article content

Visualizing yolov5/articles/visualizing-yolov5

Article content

Production logging/articles/production-logging

Article content

Cpython internals/articles/cpython-internals

Article content

Cpp compilation process/articles/cpp-compilation-process

Article content

Cpp linking in depth/articles/cpp-linking-in-depth

Article content

Cpp loading runtime/articles/cpp-loading-runtime

Article content

Registry pattern/articles/registry-pattern

Article content

Magic numbers/articles/magic-numbers

Article content

Image encoding/articles/image-encoding

Article content

Text encoding/articles/text-encoding

Article content

Papers/papers

Research papers and publications

Ddpm/papers/ddpm

Paper content

Flow matching/papers/flow-matching

Paper content

Latent diffusion/papers/latent-diffusion

Paper content

Beit/papers/beit

Paper content

Dinov2/papers/dinov2

Paper content

Ijepa/papers/ijepa

Paper content

Vjepa2/papers/vjepa2

Paper content

Byol/papers/byol

Paper content

Dino/papers/dino

Paper content

Mae/papers/mae

Paper content

Moco/papers/moco

Paper content

Simclr/papers/simclr

Paper content

Vjepa/papers/vjepa

Paper content

Vicreg/papers/vicreg

Paper content

Visual instruction tuning/papers/visual-instruction-tuning

Paper content

Vit object detection/papers/vit-object-detection

Paper content

Yolo/papers/yolo

Paper content

Efficientnet/papers/efficientnet

Paper content

Faster rcnn/papers/faster-rcnn

Paper content

Sam/papers/sam

Paper content

DETR/papers/DETR

Paper content

Blip2/papers/blip2

Paper content

Image worth 16x16/papers/image-worth-16x16

Paper content

Optimizing transformer inference/papers/optimizing-transformer-inference

Paper content

Surf/papers/surf

Paper content

Swin transformer/papers/swin-transformer

Paper content

Clip/papers/clip

Paper content

Deeplearning go brr/papers/deeplearning-go-brr

Paper content

Attention is all you need/papers/attention-is-all-you-need

Paper content

Data movement transformer/papers/data-movement-transformer

Paper content

Deep residual learning/papers/deep-residual-learning

Paper content

Concepts/concepts

Interactive explanations of machine learning concepts

Deep dive into C++ memory allocation — stack frame internals, heap allocator mechanics, fragmentation, performance benchmarks, custom allocators, RAII, and debugging with AddressSanitizer and Valgrind.

Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.

initramfs: The Initial RAM Filesystem Explained/concepts/linux/initramfs-boot-process

Learn how initramfs enables Linux boot by loading essential drivers before the root filesystem mounts. Explore early userspace initialization.

Linux kernel architecture explained. Learn syscalls, protection rings, user vs kernel space, and what happens when you run a command.

Calinski-Harabasz Index: The Variance Ratio Criterion/concepts/machine-learning/calinski-harabasz

How the Calinski-Harabasz index evaluates clustering quality by measuring the ratio of between-cluster to within-cluster variance — fast, intuitive, and ideal for k-selection with convex clusters.

Explore the inner workings of RAM through beautiful animations and interactive visualizations. Understand memory cells, addressing, and the memory hierarchy.

Python Bytecode Compilation/concepts/python/bytecode-compilation

Explore CPython bytecode compilation from source to .pyc files. Learn the dis module, PVM stack operations, and Python 3.11+ adaptive specialization.

PyTorch DataLoader Pipeline/concepts/pytorch/dataloader-pipeline

PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.

High Bandwidth Memory (HBM)/concepts/gpu/hbm-memory

High Bandwidth Memory (HBM) architecture: 3D-stacked DRAM with TSV technology powering NVIDIA GPUs and AI accelerators with TB/s bandwidth.

GPU Memory Hierarchy & Optimization/concepts/gpu/memory-hierarchy

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

Multi-GPU Communication: NVLink, PCIe, and NCCL/concepts/gpu/multi-gpu-communication

Compare NVLink vs PCIe bandwidth for multi-GPU training. Learn GPU topologies, NVSwitch, and choose between NCCL, Gloo, and MPI for distributed deep learning.

Slurm GPU Allocation for Distributed Training/concepts/hpc/slurm-gpu-allocation

Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.

Filesystems: The Digital DNA of Data Storage/concepts/linux/filesystems-overview

Explore Linux filesystems through interactive visuals. Learn VFS, compare ext4 vs Btrfs vs ZFS, and understand file operations.

Silhouette Score: Per-Point Clustering Evaluation/concepts/machine-learning/silhouette-score

How the silhouette score measures clustering quality for every individual point — comparing intra-cluster cohesion to nearest-cluster separation, with per-point diagnostics that work for arbitrary cluster shapes.

Python Memory Management/concepts/python/memory-management

Deep dive into CPython memory management: PyMalloc arenas, object pools, reference counting, and optimization techniques like __slots__ and generators.

Complete guide to C++ symbol resolution — how linkers match references to definitions, name mangling, strong vs weak symbols, ODR, template instantiation, linking order, and debugging undefined reference errors.

NVIDIA Unified Virtual Memory/concepts/gpu/unified-memory

NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.

Slurm Resource Management and Job Priority/concepts/hpc/slurm-resource-management

How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).

Filesystem Journaling: Write-Ahead Logging/concepts/linux/filesystem-journaling

Learn how filesystem journaling prevents data loss during crashes. Explore write-ahead logging and recovery in ext4 and XFS.

Understand Linux inodes - the metadata structures behind every file. Learn about hard links, soft links, and inode limits.

Davies-Bouldin Index: Worst-Case Cluster Similarity/concepts/machine-learning/davies-bouldin

How the Davies-Bouldin index evaluates clustering quality by finding each cluster's most similar neighbor — a pessimistic, worst-case metric that catches overlapping cluster pairs.

Global Interpreter Lock (GIL)/concepts/python/global-interpreter-lock

Understand CPython Global Interpreter Lock (GIL): thread switching, CPU vs I/O workloads, multiprocessing workarounds, and PEP 703 no-GIL future.

Pinned Memory and DMA Transfers in PyTorch/concepts/pytorch/pin-memory

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

How C++ programs are loaded — ELF segments, the _start to main() chain, dynamic linking with PLT/GOT, ASLR, real readelf/strace/proc maps output, and startup debugging.

Page Migration & Fault Handling/concepts/gpu/page-migration

CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.

Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.

Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.

OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.

Slurm Accounting and Resource Tracking/concepts/hpc/slurm-accounting

How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.

Understand Copy-on-Write (CoW) in Btrfs and ZFS. Learn how CoW enables instant snapshots, atomic writes, and data integrity.

FUSE: Filesystem in Userspace Explained/concepts/linux/fuse-filesystem

Learn FUSE (Filesystem in Userspace) for building custom filesystems. Understand how NTFS-3G, SSHFS, and cloud storage work.

Python Object Model Internals/concepts/python/object-model

Learn how CPython implements PyObject, type objects, and the unified object model. Explore reference counting, memory layout, and Python internals.

HPC Performance Optimization: Scaling, Profiling, and Tuning/concepts/hpc/hpc-performance-optimization

Mastering HPC performance — Amdahl's Law, Gustafson's Law, strong vs weak scaling, roofline model, communication-computation overlap, load balancing, and profiling with Nsight and VTune.

How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).

ext4: The Linux Workhorse Filesystem/concepts/linux/ext4-filesystem

Explore ext4, the default Linux filesystem with journaling, extents, and proven reliability. Learn how ext4 protects your data.

Filesystem Snapshots: Time Travel for Your Data/concepts/linux/filesystem-snapshots

How modern filesystems create instant snapshots. Explore Btrfs/ZFS snapshot mechanics, rollback operations, and backup strategies interactively.

Python Garbage Collection/concepts/python/garbage-collection

Understand CPython garbage collection: reference counting, generational GC for circular references, weak references, and gc module tuning strategies.

CPU Pipeline Architecture/concepts/computer-architecture/cpu-pipeline-detailed

Deep dive into CPU pipeline architecture covering 5-stage RISC pipelines, data hazards, control hazards, superscalar execution, and out-of-order processing.

Master Linux mount options like noatime and async for performance tuning and security hardening. Interactive guide to fstab configuration.

NTFS Filesystem: The Master File Table/concepts/linux/ntfs-filesystem

Understand how NTFS organizes files through the Master File Table (MFT), including the key distinction between resident and non-resident file storage.

Python Optimization Techniques/concepts/python/python-optimization

Python performance optimization guide: CPython peephole optimizer, lru_cache, profiling with cProfile, and Python 3.11+ adaptive bytecode specialization.

Contrastive Learning/concepts/embeddings/contrastive-learning

Master contrastive learning for vector embeddings: how InfoNCE loss and self-supervised techniques train models to create high-quality semantic representations.

Btrfs: Modern Copy-on-Write Filesystem/concepts/linux/btrfs-filesystem

Learn Btrfs with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.

Filesystem Data Integrity: Detecting Silent Corruption/concepts/linux/filesystem-integrity

Understand how modern filesystems use checksums to detect silent data corruption that traditional filesystems miss entirely.

__slots__ Optimization/concepts/python/slots-optimization

Master Python __slots__ for 40-50% memory reduction and faster attribute access. Learn CPython descriptor protocol, inheritance patterns, and best practices.

Cross-Lingual Alignment/concepts/embeddings/cross-lingual-alignment

Learn cross-lingual embedding alignment techniques like VecMap and MUSE for multilingual vector retrieval and zero-shot language transfer in search systems.

NVIDIA Device Files in /dev//concepts/gpu/nvidia-device-files

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

ZFS: The Ultimate Filesystem/concepts/linux/zfs-filesystem

Master ZFS filesystem with pooled storage, RAID-Z, snapshots, and checksums. Learn enterprise-grade data integrity on Linux.

Green Threads vs OS Threads: Concurrency Models/concepts/python/green-threads-vs-os-threads

Compare Python green threads vs OS threads. Learn asyncio coroutines, gevent, context switching costs, and when to use each concurrency model.

Domain Adaptation for Embeddings/concepts/embeddings/domain-adaptation

Domain adaptation for embeddings: transfer learning to fine-tune retrieval models across domains while preventing catastrophic forgetting.

XFS: High-Performance Parallel Filesystem/concepts/linux/xfs-filesystem

XFS filesystem internals: allocation groups, extent-based allocation, and delayed allocation for high-performance parallel I/O.

Python asyncio: Mastering Asynchronous Programming/concepts/python/asyncio-event-loop

Deep dive into Python's asyncio library, understanding event loops, coroutines, tasks, and async/await patterns with interactive visualizations.

Binary Embeddings for Fast Search/concepts/embeddings/binary-embeddings

Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.

FAT32 & exFAT: Universal Filesystems/concepts/linux/fat-filesystems

Learn FAT32 and exFAT filesystems for cross-platform USB drives and SD cards. Understand file size limits and compatibility.

Python Shared Memory/concepts/python/shared-memory

Master Python multiprocessing.shared_memory for zero-copy IPC. Learn synchronization, NumPy integration, and race condition prevention patterns.

Hybrid Retrieval Systems/concepts/embeddings/hybrid-retrieval-systems

Build hybrid retrieval systems combining BM25 sparse search with dense vector embeddings using reciprocal rank fusion for superior semantic search performance.

RAID storage visualized: RAID 0, 1, 5, 6, and 10 levels explained. Learn how they work, when to use them, and disk failure recovery.

Memory Controllers: The Brain Behind RAM Management/concepts/memory/memory-controllers

Learn how memory controllers manage CPU-RAM data flow. Interactive demos of channels, ranks, banks, and command scheduling for optimal bandwidth.

BM25 Algorithm for Text Retrieval/concepts/embeddings/bm25-algorithm

Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.

Linux Process Management: Fork, Exec, and Beyond/concepts/linux/process-management

Master Linux process management through interactive visualizations. Understand process lifecycle, fork/exec operations, zombies, orphans, and CPU scheduling.

Distributed Parallelism in Deep Learning/concepts/gpu/distributed-parallelism

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

Explore Linux memory management through interactive visualizations. Understand virtual memory, page tables, TLB, swapping, and memory allocation.

Linux system calls visualized: how user programs communicate with the kernel, protection rings, context switching, and syscall performance.

Master the Linux networking stack through interactive visualizations. Understand TCP/IP layers, sockets, iptables, routing, and network namespaces.

Linux Boot Process: From Power-On to Login/concepts/linux/boot-process

Visualize the complete Linux boot sequence from BIOS/UEFI to login. Learn how GRUB, kernel, and systemd work together with interactive visualizations.

Linux Init Systems: From SysV to systemd/concepts/linux/init-systems

Compare Linux init systems through interactive visualizations. Understand the evolution from SysV Init to systemd, service management, and boot orchestration.

Master Linux kernel modules through interactive visualizations. Learn how to load, unload, develop, and debug kernel modules that extend Linux functionality.

Master Linux namespaces — the kernel mechanism that makes containers possible. Learn how mount, PID, network, and user namespaces create isolated environments, with interactive demos.

Compare Wayland vs X11 display servers on Linux. Learn about architecture, performance, security, and modern graphics stack.

Master cgroups to limit CPU, memory, and I/O for process groups. Understand cgroups v1 vs v2, the hierarchical structure, and how containers use them.

Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.

Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.

Representation Collapse in Self-Supervised Learning/concepts/deep-learning/collapse-risk

Understanding complete, dimensional, and cluster collapse — the failure modes that every self-supervised method must prevent. Learn why collapse happens and how contrastive, asymmetric, regularization, and masking approaches solve it.

Learn nvidia-modeset for display configuration on Linux. Understand kernel mode-setting, DRM integration, and GPU drivers.

CUDA Multi-Process Service (MPS)/concepts/gpu/cuda-mps

Learn CUDA Multi-Process Service (MPS) for GPU sharing. Enable concurrent kernel execution from multiple processes and maximize GPU utilization.

Understanding TCP/IP Protocol Stack/concepts/networking/tcp-ip

Explore the TCP/IP protocol stack, packet encapsulation, and how data travels through network layers from application to physical transmission.

CPU Pipelines & Branch Prediction in Processors/concepts/computer-architecture/cpu-pipelines

Explore CPU pipeline stages, instruction-level parallelism, pipeline hazards, and branch prediction through interactive visualizations.

Hazard Detection: Pipeline Dependencies and Solutions/concepts/computer-architecture/hazard-detection

Master pipeline hazards through interactive visualizations of data dependencies, control hazards, structural conflicts, and advanced detection mechanisms.

Convolution Operation: The Foundation of CNNs/concepts/deep-learning/convolution-operation

Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.

Dilated Convolutions: Expanding Receptive Fields Efficiently/concepts/deep-learning/dilated-convolutions

Understand dilated (atrous) convolutions: how dilation rates expand receptive fields exponentially without extra parameters and how to avoid gridding artifacts.

Feature Pyramid Networks/concepts/deep-learning/feature-pyramid-networks

Learn how Feature Pyramid Networks build multi-scale feature representations through top-down pathways and lateral connections for robust object detection.

Receptive Field in CNNs/concepts/deep-learning/receptive-field

Understand receptive fields in CNNs: how convolutional layers expand their field of view and the gap between theoretical and effective receptive fields.

VAE Latent Space: Understanding Variational Autoencoders/concepts/deep-learning/vae-latent-space

Explore VAE latent space in deep learning. Learn variational autoencoder encoding, decoding, interpolation, and the reparameterization trick.

Complete C++ thread safety guide — race conditions with step-through simulation, mutexes, atomics, condition variables, deadlock detection, memory ordering, and Thread Sanitizer walkthrough.

Cross-Entropy Loss for Classification/concepts/machine-learning/cross-entropy-loss

Understand cross-entropy loss for classification: interactive demos of binary and multi-class CE, the -log(p) curve, softmax gradients, and focal loss.

Master virtual memory and TLB address translation with interactive demos. Learn page tables, page faults, and memory management optimization.

CPU Cache Lines: The Unit of Memory Transfer/concepts/memory/cpu-cache-lines

Learn how CPU cache lines transfer data between memory and cache. Understand spatial locality and optimize memory access patterns for better performance.

Memory Access Patterns: Sequential vs Strided/concepts/memory/memory-access-patterns

Master sequential vs strided memory access patterns. Learn how cache efficiency and hardware prefetching affect application performance.

Memory Interleaving: Parallel Memory Access/concepts/memory/memory-interleaving

Discover how memory interleaving distributes addresses across banks for parallel access. Boost memory bandwidth in DDR5 and GPU systems.

NUMA Architecture: Non-Uniform Memory Access/concepts/memory/numa-architecture

Explore NUMA architecture and memory locality in multi-socket systems. Understand local vs remote memory access latency and optimization strategies.

Understanding NVIDIA Kubernetes GPU Operator/concepts/gpu/kubernetes-operator

Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.

Understanding CUDA Contexts/concepts/gpu/cuda-context

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

CLS Token in Vision Transformers/concepts/attention/cls-token

Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.

Hierarchical Attention in Vision Transformers/concepts/attention/hierarchical-attention

Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

Multi-Head Attention in Vision Transformers/concepts/attention/multihead-attention

Explore how multi-head attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

Positional Embeddings in Vision Transformers/concepts/attention/positional-embeddings-vit

Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.

Interactive Look: Self-Attention in Vision Transformers/concepts/attention/self-attention-vit

Explore how self-attention enables Vision Transformers (ViT) to understand images by capturing global context, with CNN comparison.

Transparent Huge Pages (THP): Reducing TLB Pressure/concepts/memory/transparent-huge-pages

Learn how Transparent Huge Pages (THP) reduces TLB misses by promoting 4KB to 2MB pages. Understand performance benefits and memory bloat tradeoffs.

ALiBi: Attention with Linear Biases/concepts/attention/alibi

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.

MHA vs GQA vs MQA: Choosing the Right Attention/concepts/attention/attention-comparison

Compare Multi-Head, Grouped-Query, and Multi-Query Attention mechanisms to understand their trade-offs and choose the optimal approach for your use case.

Attention Sinks: Stable Streaming LLMs/concepts/attention/attention-sinks

Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.

Cross-Attention: Bridging Different Modalities/concepts/attention/cross-attention

Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.

Grouped-Query Attention (GQA)/concepts/attention/grouped-query-attention

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

Linear Attention Approximations/concepts/attention/linear-attention-approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

Masked and Causal Attention/concepts/attention/masked-attention

Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.

Multi-Query Attention (MQA)/concepts/attention/multi-query-attention

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

Rotary Position Embeddings (RoPE)/concepts/attention/rotary-position-embeddings

Learn Rotary Position Embeddings (RoPE), the elegant position encoding using rotation matrices, powering LLaMA, Mistral, and modern LLMs.

Scaled Dot-Product Attention/concepts/attention/scaled-dot-product

Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.

Sliding Window Attention/concepts/attention/sliding-window-attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

Sparse Attention Patterns/concepts/attention/sparse-attention-patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

SoA vs AoS: Data Layout Optimization/concepts/computer-architecture/soa-vs-aos

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

Contrastive Loss for Representation Learning/concepts/deep-learning/contrastive-loss

Understand contrastive loss for representation learning: interactive demos of InfoNCE, triplet loss, and embedding space clustering with temperature tuning.

Dropout Regularization/concepts/deep-learning/dropout

Understand dropout regularization: how randomly silencing neurons prevents overfitting, the inverted dropout trick, and when to use each dropout variant.

Focal Loss: Focusing on Hard Examples/concepts/deep-learning/focal-loss

Learn focal loss for deep learning: down-weight easy examples, focus on hard ones. Interactive demos of gamma, alpha balancing, and RetinaNet.

He/Kaiming Initialization/concepts/deep-learning/he-initialization

Learn He (Kaiming) initialization for ReLU networks: why ReLU needs special weight initialization, variance flow, and dead neurons explained.

KL Divergence in Machine Learning/concepts/deep-learning/kl-divergence

Learn KL divergence for machine learning: measure distribution differences in VAEs, knowledge distillation, and variational inference.

Xavier/Glorot Initialization/concepts/deep-learning/xavier-initialization

Learn Xavier (Glorot) initialization: how it balances forward signals and backward gradients to enable stable deep network training with tanh and sigmoid.

MSE and MAE Loss Functions/concepts/machine-learning/mse-mae

Interactive guide to MSE vs MAE for regression: explore outlier sensitivity, gradient behavior, and Huber loss with visualizations.

Understanding NVIDIA Persistence Daemon/concepts/gpu/nvidia-persistence-daemon

Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.

ANN Algorithms Comparison/concepts/embeddings/ann-comparison

Compare all approximate nearest neighbor algorithms side-by-side: HNSW, IVF-PQ, LSH, Annoy, and ScaNN. Find the best approach for your use case.

HNSW: Hierarchical Navigable Small World/concepts/embeddings/hnsw-search

Interactive visualization of HNSW - the graph-based algorithm that powers modern vector search with logarithmic complexity.

Vector Index Structures/concepts/embeddings/index-structures

Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.

Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.

LSH: Locality Sensitive Hashing/concepts/embeddings/lsh-search

Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.

Vector Quantization Techniques/concepts/embeddings/vector-quantization

Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.

Long Polling: The Patient Connection/concepts/networking/long-polling

Learn HTTP long polling - a server-side technique that holds connections open until data arrives. Achieve near real-time updates with standard protocols.

Short Polling: The Impatient Client Pattern/concepts/networking/short-polling

Learn short polling in networking - a simple HTTP pattern for periodic data fetching. See why 70-90% of requests waste bandwidth and when to use alternatives.

Master WebSocket protocol for real-time bidirectional communication over TCP. Learn handshakes, frames, and building low-latency web applications.

Adaptive Tiling: Efficient Visual Token Generation/concepts/deep-learning/adaptive-tiling

Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts while preserving detail.

Emergent Abilities in Large Language Models/concepts/deep-learning/emergent-abilities

Explore emergent abilities in large language models: sudden capabilities at scale thresholds, phase transitions, and the mirage debate.

Prompt Engineering for LLMs/concepts/deep-learning/prompt-engineering

Master prompt engineering for large language models: from basic composition to Chain-of-Thought, few-shot, and advanced techniques.

Prompt Influence Flow Through Transformer Layers/concepts/deep-learning/prompt-influence-flow

Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.

Neural Scaling Laws Explained/concepts/deep-learning/scaling-laws

Explore neural scaling laws in deep learning: power law relationships between model size, data, and compute that predict AI performance.

Visual Complexity Analysis: Smart Image Processing/concepts/deep-learning/visual-complexity-analysis

Learn visual complexity analysis in deep learning - how neural networks measure entropy, edges, and saliency for adaptive image processing.

Cross-Encoder vs Bi-Encoder/concepts/embeddings/cross-encoder-vs-bi-encoder

Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.

Dense Embeddings Space Explorer/concepts/embeddings/dense-embeddings

Interactive visualization of high-dimensional vector spaces, word relationships, and semantic arithmetic operations.

Matryoshka Embeddings/concepts/embeddings/matryoshka-embeddings

Matryoshka embeddings: nested representations enabling dimension reduction by simple truncation without model retraining for flexible retrieval.

Multi-Vector Late Interaction/concepts/embeddings/multi-vector-late-interaction

Explore ColBERT and other multi-vector retrieval models that use fine-grained token-level matching for superior search quality.

Quantization Effects Simulator/concepts/embeddings/quantization-effects

Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.

Sparse vs Dense Embeddings/concepts/embeddings/sparse-vs-dense

Compare lexical (BM25/TF-IDF) and semantic (BERT) retrieval approaches, understanding their trade-offs and hybrid strategies.

Context Windows: The Memory Limits of LLMs/concepts/llms/context-windows

Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.

Flash Attention: IO-Aware Exact Attention/concepts/llms/flash-attention

Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.

Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.

Tokenization: Converting Text to Numbers/concepts/llms/tokenization

Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

The Vision-Language Alignment Problem/concepts/multimodal/alignment-problem

How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.

The Modality Gap in Multimodal AI/concepts/multimodal/modality-gap

The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.

Multimodal Scaling Laws/concepts/multimodal/scaling-laws

Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.

Vision-Language Adapters: Efficient Fine-tuning/concepts/multimodal/vision-language-adapters

Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.

Client-Server Communication: Polling vs WebSockets/concepts/networking/client-server-communication

Learn client-server communication patterns including short polling, long polling, and WebSockets. Compare HTTP protocols for real-time web applications.

Gradient Flow in Deep Networks/concepts/deep-learning/gradient-flow

Learn how gradients propagate through deep neural networks during backpropagation. Understand vanishing and exploding gradient problems.

C++ AST & Parsing Explained/concepts/cpp/ast-parsing

Explore how C++ code is parsed into an Abstract Syntax Tree (AST). Learn lexical analysis, tokenization, and syntax parsing for systems programming.

C++ Compilation Overview/concepts/cpp/compilation

Understand the complete C++ compilation pipeline from source code to object files. Learn preprocessing, parsing, code generation, and optimization stages.

C++ Dynamic Linking at Runtime/concepts/cpp/dynamic-linking

Deep dive into dynamic linking — GOT/PLT lazy resolution, shared library creation, SONAME versioning, RPATH/RUNPATH, dlopen plugin systems, LD_PRELOAD, and debugging with LD_DEBUG.

C++ Linking Overview/concepts/cpp/linking

How C++ object files are linked into executables. Learn symbol resolution, static vs dynamic linking, and linker optimization.

Memory Management & RAII in C++/concepts/cpp/memory-raii

Learn Resource Acquisition Is Initialization (RAII) - the cornerstone of C++ memory management. Understand automatic resource cleanup and exception safety.

Modern C++ Features (C++11 and Beyond)/concepts/cpp/modern-cpp-features

Explore modern C++ features including auto, lambdas, ranges, and coroutines. Learn how C++11/14/17/20 transformed the language.

Object-Oriented Programming in C++/concepts/cpp/oop-inheritance

Master C++ OOP concepts including inheritance, polymorphism, virtual functions, and modern object-oriented design principles with interactive examples.

C++ Compiler Optimization/concepts/cpp/optimization

C++ compiler optimization deep dive — optimization levels compared with assembly output, auto-vectorization, LTO, PGO, compiler flags reference, and dangerous flags explained.

Pointers & References in C++/concepts/cpp/pointers-references

Master C++ pointers and references through interactive visualizations. Learn memory addressing, dereferencing, smart pointers, and avoid common pitfalls.

C++ Preprocessor Directives/concepts/cpp/preprocessor

C++ preprocessor visualized: macros, header guards, conditional compilation, and #include directives explained interactively.

Smart Pointers in Modern C++/concepts/cpp/smart-pointers

Master C++11 smart pointers through interactive examples. Learn unique_ptr, shared_ptr, and weak_ptr with reference counting visualizations.

Templates & STL in C++/concepts/cpp/templates-stl

Master C++ templates and the Standard Template Library. Learn generic programming, template metaprogramming, and STL containers and algorithms.

NCCL: High-Performance Multi-GPU Communication/concepts/gpu/nccl-communication

Master NVIDIA NCCL for multi-GPU deep learning. Learn AllReduce, ring algorithms, and GPU-Direct communication for efficient distributed training on CUDA.

DataParallel vs DistributedDataParallel/concepts/pytorch/data-parallel

Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.

Understanding num_workers/concepts/pytorch/num-workers

Deep dive into PyTorch DataLoader num_workers parameter: how parallel workers prefetch data, optimal configuration, and common pitfalls.

ASFF: Adaptive Spatial Feature Fusion/concepts/computer-vision/asff

Learning where to fuse multi-scale features with per-pixel, per-level fusion weights. ASFF challenges FPN's uniform fusion assumption.

RoI Pooling, RoI Align & Deformable RoI Pooling/concepts/computer-vision/roi-pooling

Understanding region-based feature extraction for object detection, from quantized pooling to sub-pixel alignment and adaptive sampling

Anchor-Based vs Anchor-Free Object Detection/concepts/computer-vision/anchor-based-vs-anchor-free

Compare anchor-based vs anchor-free object detection: Faster R-CNN and RetinaNet anchors vs FCOS and CenterNet point-based methods.

Understanding how neural architecture search discovers optimal feature pyramid architectures that outperform hand-designed alternatives

Modern Object Detection: DETR and Transformers/concepts/computer-vision/modern-object-detection

Understanding end-to-end object detection with transformers, from DETR's object queries to bipartite matching and attention-based localization

NMS & Soft-NMS: Removing Duplicate Detections/concepts/computer-vision/nms-soft-nms

Understanding Non-Maximum Suppression algorithms for object detection post-processing, from greedy NMS to soft variants

NAdam: Nesterov-Accelerated Adam/concepts/deep-learning/nadam

Understand the NAdam optimizer that fuses Adam adaptive learning rates with Nesterov look-ahead momentum for faster, smoother convergence in deep learning.

Visual Complexity Analysis for Token Allocation/concepts/computer-vision/visual-complexity-analysis

Learn how visual complexity analysis optimizes vision transformer token allocation using edge detection, FFT, and entropy metrics.

NVIDIA Tensor Cores explained: mixed-precision matrix operations delivering 10x speedups for AI training and inference on CUDA GPUs.

Layer Normalization for Transformers/concepts/deep-learning/layer-normalization

Learn layer normalization for transformers and sequence models: how normalizing across features enables batch-independent training.

Internal Covariate Shift/concepts/deep-learning/internal-covariate-shift

Understand internal covariate shift: why layer input distributions change during training, how it slows convergence, and how batch norm fixes it.

Batch Normalization in Deep Learning/concepts/deep-learning/batch-normalization

Learn batch normalization in deep learning: how normalizing layer inputs accelerates training, improves gradient flow, and acts as regularization.

Skip Connections in Neural Networks/concepts/deep-learning/skip-connections

Learn how skip connections and residual learning enable training of very deep neural networks. Understand the ResNet revolution with interactive visualizations.

CPU Performance & Optimization/concepts/computer-architecture/cpu-optimization

CPU performance optimization: memory hierarchy, cache blocking, SIMD vectorization, and profiling tools for modern processors.

C++ Virtual Tables & Inheritance/concepts/cpp/virtual-tables-inheritance

C++ virtual tables (vtables) explained. Learn virtual dispatch, single/multiple inheritance, RTTI, and object memory layout visually.

Graph Attention Networks (GAT)/concepts/graph/graph-attention-networks

Adaptive attention-based aggregation for graph neural networks - multi-head attention, learned weights, and interpretable graph learning

Graph Centrality & Metrics/concepts/graph/graph-centrality

Understanding node importance through centrality measures, shortest paths, hop distances, clustering coefficients, and fundamental graph metrics

Graph Convolutional Networks (GCN)/concepts/graph/graph-convolutional-networks

Learn Graph Convolutional Networks (GCN) with spectral theory, message passing, and node classification for geometric deep learning.

Graph Embeddings and Node2Vec/concepts/graph/graph-embeddings

Learning low-dimensional vector representations of graphs through random walks, DeepWalk, Node2Vec, and skip-gram models

Graph Pooling Methods/concepts/graph/graph-pooling

Hierarchical graph coarsening techniques - TopK, SAGPool, DiffPool, and readout operations for graph-level representations

Mixture of Experts (MoE)/concepts/llms/mixture-of-experts

Understanding sparse mixture of experts models - architecture, routing mechanisms, load balancing, and efficient scaling strategies for large language models

GPU Streaming Multiprocessor (SM)/concepts/gpu/shared-multiprocessor

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

Uses/uses

Tools, software, and hardware I use

Resume/resume

My professional experience and qualifications

Bookmarks/bookmarks

A curated collection of articles and resources I find valuable

Consulting/consulting

Services and consulting offerings

Sitemap/sitemap

Visual representation of the site structure

Mastodon