2026-06-18 · 11 min read

Tensor Parallelism Explained

How big models get sliced across GPUs: column-parallel vs row-parallel splits, why the order is column-then-row, and the two all-reduces that hold a transformer layer together.

Tensor ParallelismDistributed TrainingGPU InfrastructureTransformersLLM Inference

Why split a model across GPUs at all

When a model gets large enough — think 70B parameters — a single GPU simply cannot hold it. The weights do not fit in memory, and even if they did, pushing every token through one device would be painfully slow. Tensor parallelism solves both problems at once: it slices the individual weight matrices across several GPUs so that each device stores only a fraction of the parameters and does only a fraction of the math.

The key word is individual. This is not pipeline parallelism, where different GPUs own different layers. In tensor parallelism, every GPU works on the same layer at the same time, each holding a different shard of that layer's weight matrices. It is a multi-device strategy by definition — on a single GPU there is nothing to parallelize.

Where the splitting actually happens

A transformer layer is mostly two big blocks: attention and the MLP. Those two blocks hold the overwhelming majority of the parameters and the FLOPs, so that is exactly where tensor parallelism focuses:

Attention: the QKV projection and the output projection
MLP: the up-projection and the down-projection

Everything else — LayerNorm, residual adds, the activation function — is cheap. Those stay replicated: every GPU keeps a full copy and runs them identically on identical data. There is no point sharding an operation that costs almost nothing, and replicating it avoids communication.

Column-parallel: split the output dimension

The first way to cut a weight matrix is along its output (column) dimension. Each GPU keeps the full input X (which is replicated across the group) and only a vertical slice of the weight W.

GPU 0 computes X @ W₀, GPU 1 computes X @ W₁, and so on. The outputs are slices of the final result sitting side by side:

Y = [ X @ W₀ | X @ W₁ ]

No communication is needed to produce these slices — every GPU already has the full X it needs. The result just comes out sharded along the feature dimension.

In attention, this is the natural place to split by heads. Each GPU is handed a subset of attention heads and runs the entire softmax for those heads locally. Heads are independent of one another, so there is nothing to coordinate — the attention math runs with zero inter-GPU communication.

Row-parallel: split the input dimension

The second way is to cut the weight along its input (row) dimension — the contraction dimension. To make the matrix multiply line up, the input has to be split along its columns to match.

Now each GPU computes a full-sized partial sum over its slice of the contraction dimension: GPU 0 produces P₀, GPU 1 produces P₁. Neither is the real answer on its own. The true output is their sum:

Y = P₀ + P₁

That summation lives on different devices, so producing the final result requires an all-reduce — the collective that adds every GPU's partial and hands the total back to all of them. This is the one unavoidable communication point of a row-parallel layer.

Why the order matters: column, then row

The reason tensor parallelism is fast in practice is that these two splits are designed to be chained: column-parallel feeding row-parallel.

A column-parallel layer emits output that is already sharded along the feature dimension — which is precisely the sharded input a row-parallel layer expects. So when a column layer feeds a row layer, no communication is needed in between. The shards flow straight through, and the all-reduce happens only once, at the end of the row layer.

The activation function forces the same choice. GeLU is non-linear, and crucially:

GeLU(a + b) ≠ GeLU(a) + GeLU(b)

If the activation sat between two row-parallel layers, each GPU would be holding only a partial sum (a, not a + b), and applying GeLU to a partial value gives a wrong answer. Column-parallel avoids this entirely: it keeps each hidden unit whole on a single device, so GeLU runs locally and is exact. Up-projection goes column-parallel, GeLU runs elementwise with no communication, and only then does the row-parallel down-projection trigger its all-reduce.

The whole layer, end to end

Put it together and a single tensor-parallel decoder layer looks like this — two matmul pairs, two activation/attention steps that run locally, and exactly two communication points:

input tokens ──► embedding lookup            [replicated, or split along vocab*]
   │
   ▼  x is REPLICATED (identical on every GPU in the TP group)
┌──────────────────────────────────────────────────────────────────┐
│  DECODER LAYER  (repeated N times)                                 │
│                                                                    │
│   LayerNorm ............................. REPLICATED (cheap, no split)
│      │
│      ▼
│   QKV projection ........................ COLUMN-parallel  ← split by heads
│      │                                    (each GPU gets a subset of heads)
│      ▼
│   Attention (softmax over each head) .... runs LOCALLY, no comm
│      │                                    (heads are independent)
│      ▼
│   Output projection ..................... ROW-parallel
│      │
│      ▼
│   ★ ALL-REDUCE #1 ....................... the one comm point for attention
│      │
│   + residual add ........................ REPLICATED
│
│   ── output is REPLICATED again ──
│
│   LayerNorm ............................. REPLICATED
│      │
│      ▼
│   Up-projection (W_up) .................. COLUMN-parallel  ← keeps hidden units whole
│      │
│      ▼
│   GeLU .................................. runs LOCALLY, exact (elementwise)
│      │
│      ▼
│   Down-projection (W_down) .............. ROW-parallel
│      │
│      ▼
│   ★ ALL-REDUCE #2 ....................... the one comm point for the MLP
│      │
│   + residual add ........................ REPLICATED
│                                                                    │
└──────────────────────────────────────────────────────────────────┘
   │  output REPLICATED ──► becomes next layer's replicated input
   ▼
final LayerNorm .......................... REPLICATED
   │
   ▼
LM head / unembedding .................... COLUMN-parallel (split along vocab)
   │                                       → all-gather logits if needed
   ▼
softmax ──► next-token probability

A few things are worth reading off the diagram directly:

The activations x enter and leave every layer replicated — identical on every GPU. The sharding is internal to the layer; the seams between layers carry full tensors.
The only places GPUs talk to each other are the two ★ ALL-REDUCE points, both sitting right after a row-parallel projection.
The LM head is column-parallel along the vocabulary, so each GPU produces logits for a slice of tokens. If you need the full distribution on one device, you all-gather the logit slices before the final softmax.

Counting the communication

This is the number that decides whether tensor parallelism is worth it on your hardware, because all-reduce is bounded by the interconnect (NVLink, or worse, PCIe).

Inference: two all-reduces per layer — one after attention's output projection, one after the MLP's down-projection.
Training: double it. The backward pass mirrors the forward, adding two more all-reduces per layer for the gradients, for four per layer total. On top of that, the gradients and optimizer state are themselves sharded across the devices, which is part of why tensor parallelism pairs so naturally with the rest of a distributed training stack.

That is the whole trick. Cut the big matrices the right way, chain column into row so the shards line up, keep the non-linearities on whole values, and pay for it with two all-reduces per layer. Everything else in the layer just rides along, replicated and communication-free.

*The embedding table can be replicated or split along the vocabulary dimension, the same way the LM head is — the two often share weights.