Writing — GPUs, RAG & GenAI

2026-06-23 · 9 min read

Why Tree All-Reduce Is 2·log₂(N), Not 2(N-1)

The ring's 2(N-1) grows linearly with core count — 510 steps at 256 cores. Tree all-reduce computes the identical sum but folds the cores up a binary tree and unfolds the result back down, costing 2·log₂(N). Tracing the same four-core matmul to show why the tree wins on latency for small tensors and loses on bandwidth for large ones.

Read →

2026-06-22 · 11 min read

Why All-Reduce Is Reduce-Scatter + All-Gather

All-reduce is two collectives glued together. Tracing real numbers through four cores to show why the cost is 2(N-1), not (N-1) — one N-1 to collapse the partials, a second to broadcast the sum back out.

Read →

2026-06-18 · 11 min read

Tensor Parallelism Explained

How big models get sliced across GPUs: column-parallel vs row-parallel splits, why the order is column-then-row, and the two all-reduces that hold a transformer layer together.

Read →

2026-02-27 · 14 min read

Triton Is Not CUDA in Python — It's a Tiling DSL

Triton basics: tiles vs threads, program_id, tl.arange, masks, and autotuning for fast GPU kernels.

Read →

2025-11-02 · 9 min read

Why Computers Waste Memory on Purpose (And Why That Makes GPUs Fast)

From bits to quantization: why fixed-width memory representations look wasteful but enable massive parallel speed.

Read →

2025-09-11 · 6 min read

How I Managed to Build an iOS App from Windows Using Codemagic

A practical workflow for building, signing, and shipping Flutter iOS apps from Windows using Codemagic CI/CD.

Read →

2025-08-04 · 7 min read

How I built my AI clone that talks like me

A local-first AI clone built with Ollama and ChromaDB for real-time conversations shaped by my writing, projects, and personal style.

Read →

2025-06-18 · 8 min read

How I Fine-Tuned GPT-2 for Cricket Commentary

400k commentary rows, TensorFlow tricks, and plenty of slog-sweeps. This is how I trained a model to generate match-style commentary.

Read →