2026-02-27 · 14 min read

Triton Is Not CUDA in Python — It's a Tiling DSL

Triton basics: tiles vs threads, program_id, tl.arange, masks, and autotuning for fast GPU kernels.

TritonGPU ProgrammingCUDAKernel OptimizationDeep Learning Infrastructure

The wrong mental model

I was halfway through my first Triton kernel when I realized I was thinking about it wrong. I was treating Triton like CUDA with Python syntax: one program instance per element, manual index math, the whole single-thread mental model. The kernel was 30x slower than it should have been.

Triton is not CUDA in Python. It is a completely different programming model that happens to compile down to the same GPU hardware. The difference is fundamental enough that your CUDA intuitions will actively mislead you.

CUDA: one thread, one element

In CUDA, you write code from the perspective of a single thread. One thread, one element. You launch thousands of threads organized into blocks, and each thread figures out which element it owns using threadIdx and blockIdx.

But then you must ensure threads within a warp access consecutive memory for coalescing, manually manage shared memory and synchronization barriers, and handle warp divergence. Getting all of this right is the difference between 10% and 90% hardware utilization.

Memory coalescing requires consecutive thread-to-address mapping
Shared memory (__shared__) and __syncthreads() must be managed manually
Warp divergence from if/else branches kills throughput

Triton: one program instance, one tile

Instead of writing code for one thread processing one element, you write code for one program instance processing an entire tile of elements. Not one value — a whole block of them, loaded, computed, and stored as a unit.

There is no threadIdx in Triton. You never touch shared memory. You never call a synchronization barrier. The compiler handles all of that based on the tile-level operations you describe.

The core inversion: CUDA gives you scalar programs with blocked threads. Triton gives you blocked programs with scalar threads handled by the compiler.

program_id, tl.arange, and tile ownership

When you launch a Triton kernel, you launch a grid of program instances, each identified by a unique program_id. tl.program_id(axis=0) maps directly to CUDA blockIdx.x under the hood, but you never interact with the thread level.

tl.arange(0, BLOCK_SIZE) generates a vector of consecutive integers — the local offset within a tile. pid * BLOCK_SIZE shifts those offsets to the correct global position. Every variable in a Triton kernel is a tile, not a scalar. This is the mental model shift that makes everything work.

offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) partitions data with no overlap and full coverage
Forgetting to multiply by pid makes every instance process the same first tile — a silent correctness bug
tl.load fetches an entire tile simultaneously, returning a tensor of values

Masks: protecting boundary tiles

When data size is not perfectly divisible by block size (which is almost always), the last tile reaches out of bounds. On a GPU there is no bounds checking — tl.load reads whatever bytes sit at that address, and tl.store can corrupt other tensors or crash the process.

Masks are boolean tensors that guard every load and store. When mask[i] is False, tl.load returns the other value instead of touching memory, and tl.store simply skips that position.

other=0.0 works for addition and sum reductions
other=-float("inf") is required for max operations like softmax — zeros would become spurious maxima for all-negative rows
Every operation has its own correct identity element: 0 for sum, inf for min, 1 for multiplication

Scaling to matrix multiplication

For matmul, you tile the output matrix C into rectangular chunks and launch a 2D grid of program instances. Each instance identifies itself with pid_m and pid_n, owning one output tile.

To compute a 128x128 output tile from 2048x2048 inputs, each program instance iterates over the K dimension in chunks. Each iteration loads a 128x128 slice of A and B, multiplies them with tl.dot (mapped to tensor core instructions), and accumulates the partial result. After K/128 iterations, the output tile is complete.

2D tile offsets use tl.arange with broadcasting: row_offsets[:, None] + col_offsets[None, :]
256 program instances run in parallel, each iterating through 16 K-dimension chunks sequentially
Masks protect boundary tiles the same way as in the 1D case

Autotuning and performance

The single most impactful knob is BLOCK_SIZE. Benchmarks show BLOCK_SIZE=4 versus an autotuned size can produce a 95x performance gap. With tiny blocks, the GPU cannot hide memory latency — there is nothing to overlap while threads wait for HBM data.

Triton provides @triton.autotune to search across configurations. The key parameter triggers re-search when problem size changes. num_warps controls parallelism within a tile — more warps mean more parallelism but also more register pressure.

For matmul, the order in which program instances map to output tiles matters enormously for L2 cache reuse. Grouped ordering — processing tiles in clusters so nearby instances share input data — jumps performance from roughly 220 to 245 TFLOPS on an A100.

Algorithm-level fusion delivers the largest gains: a fused Triton softmax does 2 HBM accesses instead of 8, roughly 4x less memory traffic
Triton matmul achieves over 90% of peak device throughput, matching cuBLAS
Production Triton kernels typically hit 80-95% of expert CUDA performance with a fraction of the code

What the compiler does for you

The compilation pipeline has six stages: Python to Triton IR, to Triton GPU IR, to LLVM IR, to PTX, to cubin. The magic happens in the Triton IR to GPU IR stage, where the compiler assigns thread-to-element layouts for memory coalescing, manages shared memory and synchronization barriers, lowers tl.dot to tensor core MMA instructions, and inserts software pipelining to overlap compute with memory fetches.

What you still control: tile sizes, grid dimensions, which data to load and when, the algorithm itself. You make the architectural decisions. The compiler handles the thousands of micro-decisions that make GPU code fast.