Sitemap
A visual representation of the site structure to help you navigate through the content.
Site Structure
Main landing page with introduction and recent articles
Learn more about me, my background, and expertise
My talks, presentations, and speaking engagements
Collection of articles I've written on various topics
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Article content
Research papers and publications
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Paper content
Interactive explanations of machine learning concepts
How the Calinski-Harabasz index evaluates clustering quality by measuring the ratio of between-cluster to within-cluster variance — fast, intuitive, and ideal for k-selection with convex clusters.
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.
Explore CPython bytecode compilation from source to .pyc files. Learn the dis module, PVM stack operations, and Python 3.11+ adaptive specialization.
PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.
Deep dive into C++ memory allocation — stack frame internals, heap allocator mechanics, fragmentation, performance benchmarks, custom allocators, RAII, and debugging with AddressSanitizer and Valgrind.
Deep dive into CPU cache lines — interactive cache simulator with configurable associativity and replacement policies, false sharing MESI protocol visualization, access pattern benchmarks, and optimization techniques.
Explore the inner workings of RAM through beautiful animations and interactive visualizations. Understand memory cells, addressing, and the memory hierarchy.
Learn how initramfs enables Linux boot by loading essential drivers before the root filesystem mounts. Explore early userspace initialization.
Linux kernel architecture explained. Learn syscalls, protection rings, user vs kernel space, and what happens when you run a command.
How the silhouette score measures clustering quality for every individual point — comparing intra-cluster cohesion to nearest-cluster separation, with per-point diagnostics that work for arbitrary cluster shapes.
How sparse retrieval (BM25/TF-IDF), dense retrieval (BERT-style embeddings), and hybrid systems that combine both compare on recall, semantic understanding, computational cost, and operational complexity for modern search.
How HBM works: 3D-stacked DRAM, TSVs, and silicon interposers explained with interactive visualizations — from the memory wall to HBM4 and the roofline model.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.
Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.
A practical mental model for CPython memory management: names and references, object headers, PyMalloc arenas, reference counting, reuse paths, and memory profiling.
Explore Linux filesystems through interactive visuals. Learn VFS, compare ext4 vs Btrfs vs ZFS, and understand file operations.
Master virtual memory and TLB address translation with interactive demos. Learn page tables, page faults, and memory management optimization.
How the Davies-Bouldin index evaluates clustering quality by finding each cluster's most similar neighbor — a pessimistic, worst-case metric that catches overlapping cluster pairs.
Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.
How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).
NVIDIA Unified Virtual Memory (UVM): on-demand page migration, memory oversubscription, and simplified CPU-GPU memory management.
Learn the CPython Global Interpreter Lock (GIL) from first principles: why it exists, how threads take turns, why I/O still works well, and when to use multiprocessing, asyncio, or native extensions.
Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.
Complete guide to C++ symbol resolution — how linkers match references to definitions, name mangling, strong vs weak symbols, ODR, template instantiation, linking order, and debugging undefined reference errors.
Learn how filesystem journaling prevents data loss during crashes. Explore write-ahead logging and recovery in ext4 and XFS.
Understand Linux inodes - the metadata structures behind every file. Learn about hard links, soft links, and inode limits.
Master sequential vs strided memory access patterns. Learn how cache efficiency and hardware prefetching affect application performance.
How a transformer’s per-token outputs become one embedding: CLS, mean, max, last-token, and attention pooling — what each does and when to use it.
Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.
Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.
OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.
CUDA page migration and fault handling between CPU and GPU memory. Learn TLB management, DMA transfers, and memory optimization.
How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.
How C++ programs are loaded — ELF segments, the _start to main() chain, dynamic linking with PLT/GOT, ASLR, real readelf/strace/proc maps output, and startup debugging.
Learn how CPython implements PyObject, type objects, and the unified object model. Explore reference counting, memory layout, and Python internals.
Understand Copy-on-Write (CoW) in Btrfs and ZFS. Learn how CoW enables instant snapshots, atomic writes, and data integrity.
Learn FUSE (Filesystem in Userspace) for building custom filesystems. Understand how NTFS-3G, SSHFS, and cloud storage work.
Master contrastive learning for vector embeddings: how InfoNCE loss and self-supervised techniques train models to create high-quality semantic representations.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
Mastering HPC performance — Amdahl's Law, Gustafson's Law, strong vs weak scaling, roofline model, communication-computation overlap, load balancing, and profiling with Nsight and VTune.
How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).
Understand CPython garbage collection: reference counting, generational GC for circular references, weak references, and gc module tuning strategies.
Explore ext4, the default Linux filesystem with journaling, extents, and proven reliability. Learn how ext4 protects your data.
Matryoshka embeddings: nested representations enabling dimension reduction by simple truncation without model retraining for flexible retrieval.
Learn a profiler-first Python optimization workflow: measure bottlenecks, choose the right lever, and verify performance changes.
Deep dive into CPU pipeline architecture covering 5-stage RISC pipelines, data hazards, control hazards, superscalar execution, and out-of-order processing.
Explore CPU pipeline stages, instruction-level parallelism, pipeline hazards, and branch prediction through interactive visualizations.
Master pipeline hazards through interactive visualizations of data dependencies, control hazards, structural conflicts, and advanced detection mechanisms.
Master Linux mount options like noatime and async for performance tuning and security hardening. Interactive guide to fstab configuration.
NTFS internals from the Master File Table outward: 1 KB attribute records, resident vs non-resident $DATA, run lists, alternate data streams, the $LogFile journal, and why dual-boot Linux distros prefer ntfs3 over ntfs-3g.
Domain adaptation for embeddings: transfer learning to fine-tune retrieval models across domains while preventing catastrophic forgetting.
Learn when Python __slots__ reduces memory, how slot storage differs from __dict__, and the caveats for dataclasses and inheritance.
Learn the Btrfs filesystem with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.
Understand how modern filesystems use Merkle-tree checksums and mirrored pools to detect, repair, and proactively scrub silent data corruption that ext4 and XFS miss entirely.
Learn cross-lingual embedding alignment techniques like VecMap and MUSE for multilingual vector retrieval and zero-shot language transfer in search systems.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Complete guide to Python concurrency — OS threads, green threads (asyncio), the GIL, event loop internals, Python 3.13 free-threading, and production patterns.
Master ZFS filesystem with pooled storage, RAID-Z, snapshots, and checksums. Learn enterprise-grade data integrity on Linux.
Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.
Deep dive into Python's asyncio library, understanding event loops, coroutines, tasks, and async/await patterns with interactive visualizations.
XFS internals end-to-end: allocation groups for lock-free parallel metadata, B+ trees instead of bitmaps, extent-based allocation that scales to terabytes, and delayed allocation that turns scattered writes into contiguous extents.
Explore ColBERT and other multi-vector retrieval models that use fine-grained token-level matching for superior search quality.
Master Python multiprocessing.shared_memory for zero-copy IPC. Learn synchronization, NumPy integration, and race condition prevention patterns.
Learn FAT32 and exFAT filesystems for cross-platform USB drives and SD cards. Understand file size limits and compatibility.
Build hybrid retrieval systems combining BM25 sparse search with dense vector embeddings using reciprocal rank fusion for superior semantic search performance.
Learn how memory controllers manage CPU-RAM data flow. Interactive demos of channels, ranks, banks, and command scheduling for optimal bandwidth.
Discover how memory interleaving distributes addresses across banks for parallel access. Boost memory bandwidth in DDR5 and GPU systems.
Explore NUMA architecture and memory locality in multi-socket systems. Understand local vs remote memory access latency and optimization strategies.
RAID storage visualized: RAID 0, 1, 5, 6, and 10 levels explained. Learn how they work, when to use them, and disk failure recovery.
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.
Master Linux process management through interactive visualizations. Understand process lifecycle, fork/exec operations, zombies, orphans, and CPU scheduling.
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.
GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.
Explore Linux memory management through interactive visualizations. Understand virtual memory, page tables, TLB, swapping, and memory allocation.
Learn how Transparent Huge Pages (THP) reduces TLB misses by promoting 4KB to 2MB pages. Understand performance benefits and memory bloat tradeoffs.
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
Linux system calls visualized: how user programs communicate with the kernel, protection rings, context switching, and syscall performance.
Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.
Learn client-server communication patterns including short polling, long polling, and WebSockets. Compare HTTP protocols for real-time web applications.
Learn HTTP long polling - a server-side technique that holds connections open until data arrives. Achieve near real-time updates with standard protocols.
Master the Linux networking stack through interactive visualizations. Understand TCP/IP layers, sockets, iptables, routing, and network namespaces.
Learn short polling in networking - a simple HTTP pattern for periodic data fetching. See why 70-90% of requests waste bandwidth and when to use alternatives.
Explore the TCP/IP protocol stack, packet encapsulation, and how data travels through network layers from application to physical transmission.
Master WebSocket protocol for real-time bidirectional communication over TCP. Learn handshakes, frames, and building low-latency web applications.
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
Visualize the complete Linux boot sequence from BIOS/UEFI to login. Learn how GRUB, kernel, and systemd work together with interactive visualizations.
Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.
Compare Linux init systems through interactive visualizations. Understand the evolution from SysV Init to systemd, service management, and boot orchestration.
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
Master Linux kernel modules through interactive visualizations. Learn how to load, unload, develop, and debug kernel modules that extend Linux functionality.
How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.
Master Linux namespaces — the kernel mechanism that makes containers possible. Learn how mount, PID, network, and user namespaces create isolated environments, with interactive demos.
Compare Wayland vs X11 display servers on Linux. Learn about architecture, performance, security, and modern graphics stack.
Master Linux cgroups to limit CPU, memory, and I/O for process groups. Understand cgroups v1 vs v2, the hierarchical structure, and how containers use them.
Discover how containers work by combining namespaces, cgroups, and OverlayFS. Build a mental model of Docker internals through interactive visualizations.
Understand how containerized processes access GPU hardware through device files, bind mounts, and the NVIDIA container runtime. Learn the kernel driver vs user-space library distinction.
Understanding complete, dimensional, and cluster collapse — the failure modes that every self-supervised method must prevent. Learn why collapse happens and how contrastive, asymmetric, regularization, and masking approaches solve it.
Learn nvidia-modeset for display configuration on Linux. Understand kernel mode-setting, DRM integration, and GPU drivers.
BatchNorm normalizes over the batch and spatial axes; LayerNorm normalizes over the channel and spatial axes for each sample. The choice changes whether your model trains stably with batch=1, depends on batch composition at inference, and behaves consistently across train and eval.
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Learn how Feature Pyramid Networks build multi-scale feature representations through top-down pathways and lateral connections for robust object detection.
Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.
Understand cross-entropy loss for classification: interactive demos of binary and multi-class CE, the -log(p) curve, softmax gradients, and focal loss.
Understand dilated (atrous) convolutions: how dilation rates expand receptive fields exponentially without extra parameters and how to avoid gridding artifacts.
Understand receptive fields in CNNs: how convolutional layers expand their field of view and the gap between theoretical and effective receptive fields.
Explore VAE latent space in deep learning. Learn variational autoencoder encoding, decoding, interpolation, and the reparameterization trick.
Complete C++ thread safety guide — race conditions with step-through simulation, mutexes, atomics, condition variables, deadlock detection, memory ordering, and Thread Sanitizer walkthrough.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.
Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how multi-head attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how self-attention enables Vision Transformers (ViT) to understand images by capturing global context, with CNN comparison.
Understand contrastive loss for representation learning: interactive demos of InfoNCE, triplet loss, and embedding space clustering with temperature tuning.
Understand dropout regularization: how randomly silencing neurons prevents overfitting, the inverted dropout trick, and when to use each dropout variant.
Learn focal loss for deep learning: down-weight easy examples, focus on hard ones. Interactive demos of gamma, alpha balancing, and RetinaNet.
Learn He (Kaiming) initialization for ReLU networks: why ReLU needs special weight initialization, variance flow, and dead neurons explained.
Learn KL divergence for machine learning: measure distribution differences in VAEs, knowledge distillation, and variational inference.
Interactive guide to MSE vs MAE for regression: explore outlier sensitivity, gradient behavior, and Huber loss with visualizations.
Learn Xavier (Glorot) initialization: how it balances forward signals and backward gradients to enable stable deep network training with tanh and sigmoid.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Learn Rotary Position Embeddings (RoPE), the elegant position encoding using rotation matrices, powering LLaMA, Mistral, and modern LLMs.
Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Eliminating GPU initialization latency through nvidia-persistenced - a userspace daemon that maintains GPU driver state for optimal startup performance.
Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts while preserving detail.
Explore emergent abilities in large language models: sudden capabilities at scale thresholds, phase transitions, and the mirage debate.
Master prompt engineering for large language models: from basic composition to Chain-of-Thought, few-shot, and advanced techniques.
Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.
Explore neural scaling laws in deep learning: power law relationships between model size, data, and compute that predict AI performance.
Learn visual complexity analysis in deep learning - how neural networks measure entropy, edges, and saliency for adaptive image processing.
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.
Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.
Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.
The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.
Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.
Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.
Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.
Learn how gradients propagate through deep neural networks during backpropagation. Understand vanishing and exploding gradient problems.
A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.
Explore how C++ code is parsed into an Abstract Syntax Tree (AST). Learn lexical analysis, tokenization, and syntax parsing for systems programming.
Understand the complete C++ compilation pipeline from source code to object files. Learn preprocessing, parsing, code generation, and optimization stages.
Deep dive into dynamic linking — GOT/PLT lazy resolution, shared library creation, SONAME versioning, RPATH/RUNPATH, dlopen plugin systems, LD_PRELOAD, and debugging with LD_DEBUG.
How C++ object files are linked into executables. Learn symbol resolution, static vs dynamic linking, and linker optimization.
Learn Resource Acquisition Is Initialization (RAII) - the cornerstone of C++ memory management. Understand automatic resource cleanup and exception safety.
Explore modern C++ features including auto, lambdas, ranges, and coroutines. Learn how C++11/14/17/20 transformed the language.
Master C++ OOP concepts including inheritance, polymorphism, virtual functions, and modern object-oriented design principles with interactive examples.
C++ compiler optimization lab notebook — compare optimization levels, inspect compiler rewrites, diagnose auto-vectorization, and build production verification commands.
Master C++ pointers and references through interactive visualizations. Learn memory addressing, dereferencing, smart pointers, and avoid common pitfalls.
C++ preprocessor visualized: macros, header guards, conditional compilation, and #include directives explained interactively.
Master C++11 smart pointers through interactive examples. Learn unique_ptr, shared_ptr, and weak_ptr with reference counting visualizations.
Master C++ templates and the Standard Template Library. Learn generic programming, template metaprogramming, and STL containers and algorithms.
Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.
Deep dive into PyTorch DataLoader num_workers parameter: how parallel workers prefetch data, optimal configuration, and common pitfalls.
Learning where to fuse multi-scale features with per-pixel, per-level fusion weights. ASFF challenges FPN's uniform fusion assumption.
Understanding region-based feature extraction for object detection, from quantized pooling to sub-pixel alignment and adaptive sampling
Compare anchor-based vs anchor-free object detection: Faster R-CNN and RetinaNet anchors vs FCOS and CenterNet point-based methods.
Understanding how neural architecture search discovers optimal feature pyramid architectures that outperform hand-designed alternatives
Understanding end-to-end object detection with transformers, from DETR's object queries to bipartite matching and attention-based localization
Understanding Non-Maximum Suppression algorithms for object detection post-processing, from greedy NMS to soft variants
Understand the NAdam optimizer that fuses Adam adaptive learning rates with Nesterov look-ahead momentum for faster, smoother convergence in deep learning.
Learn how visual complexity analysis optimizes vision transformer token allocation using edge detection, FFT, and entropy metrics.
NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.
Learn layer normalization for transformers and sequence models: how normalizing across features enables batch-independent training.
Understand internal covariate shift: why layer input distributions change during training, how it slows convergence, and how batch norm fixes it.
Learn batch normalization in deep learning: how normalizing layer inputs accelerates training, improves gradient flow, and acts as regularization.
Learn how skip connections and residual learning enable training of very deep neural networks. Understand the ResNet revolution with interactive visualizations.
C++ virtual tables (vtables) explained. Learn virtual dispatch, single/multiple inheritance, RTTI, and object memory layout visually.
Adaptive attention-based aggregation for graph neural networks - multi-head attention, learned weights, and interpretable graph learning
Understanding node importance through centrality measures, shortest paths, hop distances, clustering coefficients, and fundamental graph metrics
Learn Graph Convolutional Networks (GCN) with spectral theory, message passing, and node classification for geometric deep learning.
Learning low-dimensional vector representations of graphs through random walks, DeepWalk, Node2Vec, and skip-gram models
Hierarchical graph coarsening techniques - TopK, SAGPool, DiffPool, and readout operations for graph-level representations
Understanding sparse mixture of experts models - architecture, routing mechanisms, load balancing, and efficient scaling strategies for large language models
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy
Tools, software, and hardware I use
My professional experience and qualifications
A curated collection of articles and resources I find valuable
Services and consulting offerings
Visual representation of the site structure
