Writing
GPUs, RAG, embeddings, and the rabbit holes in between — published here first, sometimes mirrored to Medium.
Tensor Parallelism Explained
How big models get sliced across GPUs: column-parallel vs row-parallel splits, why the order is column-then-row, and the two all-reduces that hold a transformer layer together.
Read →Triton Is Not CUDA in Python — It's a Tiling DSL
Triton basics: tiles vs threads, program_id, tl.arange, masks, and autotuning for fast GPU kernels.
Read →Why Computers Waste Memory on Purpose (And Why That Makes GPUs Fast)
From bits to quantization: why fixed-width memory representations look wasteful but enable massive parallel speed.
Read →How I Managed to Build an iOS App from Windows Using Codemagic
A practical workflow for building, signing, and shipping Flutter iOS apps from Windows using Codemagic CI/CD.
Read →How I built my AI clone that talks like me
A local-first AI clone built with Ollama and ChromaDB for real-time conversations shaped by my writing, projects, and personal style.
Read →How I Fine-Tuned GPT-2 for Cricket Commentary
400k commentary rows, TensorFlow tricks, and plenty of slog-sweeps. This is how I trained a model to generate match-style commentary.
Read →