Sitemap
A visual representation of the site structure to help you navigate through the content.
Site Structure
Main landing page with introduction and recent articles
Learn more about me, my background, and expertise
My talks, presentations, and speaking engagements
Collection of articles I've written on various topics
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Research papers and publications
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Interactive explanations of machine learning concepts
Learn how initramfs enables Linux boot by loading essential drivers before the root filesystem mounts. Explore early userspace initialization.
Linux kernel architecture explained. Learn syscalls, protection rings, user vs kernel space, and what happens when you run a command.
Explore the inner workings of RAM through beautiful animations and interactive visualizations. Understand memory cells, addressing, and the memory hierarchy.
Explore CPython bytecode compilation from source to .pyc files. Learn the dis module, PVM stack operations, and Python 3.11+ adaptive specialization.
High Bandwidth Memory (HBM) architecture: 3D-stacked DRAM with TSV technology powering NVIDIA GPUs and AI accelerators with TB/s bandwidth.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Compare NVLink vs PCIe bandwidth for multi-GPU training. Learn GPU topologies, NVSwitch, and choose between NCCL, Gloo, and MPI for distributed deep learning.
Explore Linux filesystems through interactive visuals. Learn VFS, compare ext4 vs Btrfs vs ZFS, and understand file operations.
Deep dive into CPython memory management: PyMalloc arenas, object pools, reference counting, and optimization techniques like __slots__ and generators.
NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.
Learn how filesystem journaling prevents data loss during crashes. Explore write-ahead logging and recovery in ext4 and XFS.
Understand Linux inodes - the metadata structures behind every file. Learn about hard links, soft links, and inode limits.
Understand CPython Global Interpreter Lock (GIL): thread switching, CPU vs I/O workloads, multiprocessing workarounds, and PEP 703 no-GIL future.
CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.
Understand Copy-on-Write (CoW) in Btrfs and ZFS. Learn how CoW enables instant snapshots, atomic writes, and data integrity.
Learn FUSE (Filesystem in Userspace) for building custom filesystems. Understand how NTFS-3G, SSHFS, and cloud storage work.
Learn how CPython implements PyObject, type objects, and the unified object model. Explore reference counting, memory layout, and Python internals.
Explore ext4, the default Linux filesystem with journaling, extents, and proven reliability. Learn how ext4 protects your data.
How modern filesystems create instant snapshots. Explore Btrfs/ZFS snapshot mechanics, rollback operations, and backup strategies interactively.
Understand CPython garbage collection: reference counting, generational GC for circular references, weak references, and gc module tuning strategies.
Deep dive into CPU pipeline architecture covering 5-stage RISC pipelines, data hazards, control hazards, superscalar execution, and out-of-order processing.
Master Linux mount options like noatime and async for performance tuning and security hardening. Interactive guide to fstab configuration.
Understand how NTFS organizes files through the Master File Table (MFT), including the key distinction between resident and non-resident file storage.
Python performance optimization guide: CPython peephole optimizer, lru_cache, profiling with cProfile, and Python 3.11+ adaptive bytecode specialization.
Master contrastive learning for vector embeddings: how InfoNCE loss and self-supervised techniques train models to create high-quality semantic representations.
Learn Btrfs with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.
Understand how modern filesystems use checksums to detect silent data corruption that traditional filesystems miss entirely.
Master Python __slots__ for 40-50% memory reduction and faster attribute access. Learn CPython descriptor protocol, inheritance patterns, and best practices.
Learn cross-lingual embedding alignment techniques like VecMap and MUSE for multilingual vector retrieval and zero-shot language transfer in search systems.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Master ZFS filesystem with pooled storage, RAID-Z, snapshots, and checksums. Learn enterprise-grade data integrity on Linux.
Compare Python green threads vs OS threads. Learn asyncio coroutines, gevent, context switching costs, and when to use each concurrency model.
Domain adaptation for embeddings: transfer learning to fine-tune retrieval models across domains while preventing catastrophic forgetting.
XFS filesystem internals: allocation groups, extent-based allocation, and delayed allocation for high-performance parallel I/O.
Deep dive into Python's asyncio library, understanding event loops, coroutines, tasks, and async/await patterns with interactive visualizations.
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
Learn FAT32 and exFAT filesystems for cross-platform USB drives and SD cards. Understand file size limits and compatibility.
Master Python multiprocessing.shared_memory for zero-copy IPC. Learn synchronization, NumPy integration, and race condition prevention patterns.
Build hybrid retrieval systems combining BM25 sparse search with dense vector embeddings using reciprocal rank fusion for superior semantic search performance.
RAID storage visualized: RAID 0, 1, 5, 6, and 10 levels explained. Learn how they work, when to use them, and disk failure recovery.
Learn how memory controllers manage CPU-RAM data flow. Interactive demos of channels, ranks, banks, and command scheduling for optimal bandwidth.
Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.
Master Linux process management through interactive visualizations. Understand process lifecycle, fork/exec operations, zombies, orphans, and CPU scheduling.
GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.
Explore Linux memory management through interactive visualizations. Understand virtual memory, page tables, TLB, swapping, and memory allocation.
Linux system calls visualized: how user programs communicate with the kernel, protection rings, context switching, and syscall performance.
Master the Linux networking stack through interactive visualizations. Understand TCP/IP layers, sockets, iptables, routing, and network namespaces.
Visualize the complete Linux boot sequence from BIOS/UEFI to login. Learn how GRUB, kernel, and systemd work together with interactive visualizations.
Compare Linux init systems through interactive visualizations. Understand the evolution from SysV Init to systemd, service management, and boot orchestration.
Master Linux kernel modules through interactive visualizations. Learn how to load, unload, develop, and debug kernel modules that extend Linux functionality.
Master Linux namespaces for container isolation. Learn PID, network, mount, and user namespaces with interactive demos.
Compare Wayland vs X11 display servers on Linux. Learn about architecture, performance, security, and modern graphics stack.
Master cgroups to limit CPU, memory, and I/O for process groups. Understand cgroups v1 vs v2, the hierarchical structure, and how containers use them.
Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.
Learn nvidia-modeset for display configuration on Linux. Understand kernel mode-setting, DRM integration, and GPU drivers.
Learn CUDA Multi-Process Service (MPS) for GPU sharing. Enable concurrent kernel execution from multiple processes and maximize GPU utilization.
Explore the TCP/IP protocol stack, packet encapsulation, and how data travels through network layers from application to physical transmission.
Explore Flynn's Classification of computer architectures through interactive visualizations of SISD, SIMD, MISD, and MIMD systems.
Explore CPU pipeline stages, instruction-level parallelism, pipeline hazards, and branch prediction through interactive visualizations.
Master pipeline hazards through interactive visualizations of data dependencies, control hazards, structural conflicts, and advanced detection mechanisms.
Master thread safety concepts through interactive visualizations of race conditions, mutexes, atomic operations, and deadlock scenarios.
Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.
Understand cross-entropy loss for classification: interactive demos of binary and multi-class CE, the -log(p) curve, softmax gradients, and focal loss.
Understand dilated (atrous) convolutions: how dilation rates expand receptive fields exponentially without extra parameters and how to avoid gridding artifacts.
Learn how Feature Pyramid Networks build multi-scale feature representations through top-down pathways and lateral connections for robust object detection.
Understand receptive fields in CNNs — how convolutional layers expand their field of view, the gap between theoretical and effective receptive fields, and strategies for controlling RF growth.
Explore VAE latent space in deep learning. Learn variational autoencoder encoding, decoding, interpolation, and the reparameterization trick.
Master virtual memory and TLB address translation with interactive demos. Learn page tables, page faults, and memory management optimization.
Learn how CPU cache lines transfer data between memory and cache. Understand spatial locality and optimize memory access patterns for better performance.
Master sequential vs strided memory access patterns. Learn how cache efficiency and hardware prefetching affect application performance.
Discover how memory interleaving distributes addresses across banks for parallel access. Boost memory bandwidth in DDR5 and GPU systems.
Explore NUMA architecture and memory locality in multi-socket systems. Understand local vs remote memory access latency and optimization strategies.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.
Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how multi-head attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how self-attention enables Vision Transformers (ViT) to understand images by capturing global context, with CNN comparison.
Learn how Transparent Huge Pages (THP) reduces TLB misses by promoting 4KB to 2MB pages. Understand performance benefits and memory bloat tradeoffs.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
Compare Multi-Head, Grouped-Query, and Multi-Query Attention mechanisms to understand their trade-offs and choose the optimal approach for your use case.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Learn Rotary Position Embeddings (RoPE), the elegant position encoding using rotation matrices, powering LLaMA, Mistral, and modern LLMs.
Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
Understand contrastive loss for representation learning: interactive demos of InfoNCE, triplet loss, and embedding space clustering with temperature tuning.
Understand dropout regularization: how randomly silencing neurons prevents overfitting, the inverted dropout trick, and when to use each dropout variant.
Learn focal loss for deep learning: down-weight easy examples, focus on hard ones. Interactive demos of gamma, alpha balancing, and RetinaNet.
Learn He (Kaiming) initialization for ReLU neural networks: understand why ReLU needs special weight initialization, visualize variance flow, and see dead neurons in action.
Learn KL divergence for machine learning: measure distribution differences in VAEs, knowledge distillation, and variational inference with interactive visualizations.
Interactive guide to MSE vs MAE for regression: explore outlier sensitivity, gradient behavior, and Huber loss with visualizations.
Learn Xavier (Glorot) initialization: how it balances forward signals and backward gradients to enable stable deep network training with tanh and sigmoid.
Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.
Compare all approximate nearest neighbor algorithms side-by-side: HNSW, IVF-PQ, LSH, Annoy, and ScaNN. Find the best approach for your use case.
Interactive visualization of HNSW - the graph-based algorithm that powers modern vector search with logarithmic complexity.
Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.
Learn HTTP long polling - a server-side technique that holds connections open until data arrives. Achieve near real-time updates with standard protocols.
Learn short polling in networking - a simple HTTP pattern for periodic data fetching. See why 70-90% of requests waste bandwidth and when to use alternatives.
Master WebSocket protocol for real-time bidirectional communication over TCP. Learn handshakes, frames, and building low-latency web applications.
Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts by up to 80% while preserving detail where it matters.
Explore emergent abilities in large language models: sudden capabilities that appear at scale thresholds, phase transitions, and the mirage debate, with interactive visualizations.
Master prompt engineering for large language models: from basic composition to Chain-of-Thought, few-shot, and advanced techniques with interactive visualizations.
Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.
Explore neural scaling laws in deep learning: power law relationships between model size, data, and compute that predict AI performance, with interactive visualizations.
Learn visual complexity analysis in deep learning - how neural networks measure entropy, edges, and saliency for adaptive image processing.
Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.
Interactive visualization of high-dimensional vector spaces, word relationships, and semantic arithmetic operations.
Matryoshka embeddings: nested representations enabling dimension reduction by simple truncation without model retraining for flexible retrieval.
Explore ColBERT and other multi-vector retrieval models that use fine-grained token-level matching for superior search quality.
Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.
Compare lexical (BM25/TF-IDF) and semantic (BERT) retrieval approaches, understanding their trade-offs and hybrid strategies.
Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.
Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.
Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.
Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.
Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.
Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.
Learn client-server communication patterns including short polling, long polling, and WebSockets. Compare HTTP protocols for real-time web applications.
Explore how C++ code is parsed into an Abstract Syntax Tree (AST). Learn lexical analysis, tokenization, and syntax parsing for systems programming.
Understand the complete C++ compilation pipeline from source code to object files. Learn preprocessing, parsing, code generation, and optimization stages.
Master C++ dynamic linking and runtime library loading. Learn shared libraries, position-independent code, dlopen, and systems-level library management.
How C++ object files are linked into executables. Learn symbol resolution, static vs dynamic linking, and linker optimization.
Understand how C++ programs are loaded and executed by the operating system. Learn ELF format, process creation, memory mapping, and runtime initialization.
Learn Resource Acquisition Is Initialization (RAII) - the cornerstone of C++ memory management. Understand automatic resource cleanup and exception safety.
Explore modern C++ features including auto, lambdas, ranges, and coroutines. Learn how C++11/14/17/20 transformed the language.
Master C++ OOP concepts including inheritance, polymorphism, virtual functions, and modern object-oriented design principles with interactive examples.
C++ compiler optimization: loop unrolling, inlining, dead code elimination. Learn GCC and Clang optimization flags and techniques.
Master C++ pointers and references through interactive visualizations. Learn memory addressing, dereferencing, smart pointers, and avoid common pitfalls.
C++ preprocessor visualized: macros, header guards, conditional compilation, and #include directives explained interactively.
Master C++11 smart pointers through interactive examples. Learn unique_ptr, shared_ptr, and weak_ptr with reference counting visualizations.
C++ stack vs heap memory allocation visualized. Learn LIFO stack frames, dynamic heap allocation, and memory management patterns.
C++ symbol resolution explained: how linkers fix undefined references, name mangling, weak vs strong symbols, and common linking errors.
Master C++ templates and the Standard Template Library. Learn generic programming, template metaprogramming, and STL containers and algorithms.
Learn how gradients propagate through deep neural networks during backpropagation. Understand vanishing and exploding gradient problems with interactive visualizations.
Master NVIDIA NCCL for multi-GPU deep learning. Learn AllReduce, ring algorithms, and GPU-Direct communication for efficient distributed training on CUDA.
Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.
Understanding how PyTorch DataLoader moves data from disk through CPU to GPU, including Dataset, Sampler, Workers, and Collate components.
Deep dive into PyTorch DataLoader num_workers parameter: how parallel workers prefetch data, optimal configuration, and common pitfalls.
Understanding PyTorch pin_memory for faster CPU to GPU data transfers using DMA (Direct Memory Access) and page-locked memory.
Learning where to fuse multi-scale features with per-pixel, per-level fusion weights. ASFF challenges FPN's uniform fusion assumption.
Understanding region-based feature extraction for object detection, from quantized pooling to sub-pixel alignment and adaptive sampling
Compare anchor-based vs anchor-free object detection: Faster R-CNN and RetinaNet anchors vs FCOS and CenterNet point-based methods.
Understanding how neural architecture search discovers optimal feature pyramid architectures that outperform hand-designed alternatives
Understanding end-to-end object detection with transformers, from DETR's object queries to bipartite matching and attention-based localization
Understanding Non-Maximum Suppression algorithms for object detection post-processing, from greedy NMS to soft variants
Understand the NAdam optimizer that fuses Adam adaptive learning rates with Nesterov look-ahead momentum for faster, smoother convergence in deep learning.
Learn how visual complexity analysis optimizes vision transformer token allocation using edge detection, FFT, and entropy metrics.
NVIDIA Tensor Cores explained: mixed-precision matrix operations delivering 10x speedups for AI training and inference on CUDA GPUs.
Learn layer normalization for transformers and sequence models: how normalizing across features enables batch-independent training with interactive visualizations.
Understand internal covariate shift in deep learning: why layer input distributions change during training, how it slows convergence, and how batch normalization fixes it.
Learn batch normalization in deep learning: how normalizing layer inputs accelerates training, improves gradient flow, and acts as regularization with interactive visualizations.
Learn how skip connections and residual learning enable training of very deep neural networks. Understand the ResNet revolution with interactive visualizations.
CPU performance optimization: memory hierarchy, cache blocking, SIMD vectorization, and profiling tools for modern processors.
C++ virtual tables (vtables) explained. Learn virtual dispatch, single/multiple inheritance, RTTI, and object memory layout visually.
Adaptive attention-based aggregation for graph neural networks - multi-head attention, learned weights, and interpretable graph learning
Understanding node importance through centrality measures, shortest paths, hop distances, clustering coefficients, and fundamental graph metrics
Learn Graph Convolutional Networks (GCN) with spectral theory, message passing, and node classification for geometric deep learning.
Learning low-dimensional vector representations of graphs through random walks, DeepWalk, Node2Vec, and skip-gram models
Hierarchical graph coarsening techniques - TopK, SAGPool, DiffPool, and readout operations for graph-level representations
Understanding sparse mixture of experts models - architecture, routing mechanisms, load balancing, and efficient scaling strategies for large language models
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy
Tools, software, and hardware I use
My professional experience and qualifications
A curated collection of articles and resources I find valuable
Services and consulting offerings
Confirmation page after form submissions
Visual representation of the site structure
