2026-06-22 · 11 min read

Why All-Reduce Is Reduce-Scatter + All-Gather

All-reduce is two collectives glued together. Tracing real numbers through four cores to show why the cost is 2(N-1), not (N-1) — one N-1 to collapse the partials, a second to broadcast the sum back out.

All-ReduceCollective CommunicationDistributed TrainingGPU InfrastructureRing All-Reduce

Most explanations of all-reduce hand you a diagram of arrows going around a ring and a formula that says the cost is 2(N-1) steps. Then they move on. Nobody tells you why it is 2(N-1) and not (N-1). If you stare at the ring picture you can convince yourself a single lap around the ring is N-1 hops, so where does the second factor of 2 come from? Are we just going around twice for no reason?

The factor of 2 is not arbitrary and it is not a doubling for safety. It falls directly out of a structural fact: all-reduce is two different collectives glued together. The first is reduce-scatter, which costs N-1 steps. The second is all-gather, which also costs N-1 steps. Add them and you get 2(N-1). The whole blog is going to make that decomposition concrete by tracing real numbers through both halves, so that by the end the 2 is something you can point at rather than something you memorized.

I am going to use four cores the entire way through, with real partial sums and distinct values, because zeros and ones are impossible to track when the whole point is watching numbers move.

What all-reduce has to produce, and where the partial sums actually come from

Most all-reduce posts start by asserting some partial vectors out of thin air. That hides the one thing worth seeing: the partials are not given, they are computed, and they come out of a row-parallel matmul. So in this blog we start one step earlier, from the matmul itself.

Take a single linear layer, Y = X @ W, for one token. X is 1 x 4 and W is 4 x 4:

X = [ 3  2  1  3 ]        W = [ 5  4  3  2 ]   row 0
                              [ 1  4  3  2 ]   row 1
                              [ 1  3  4  2 ]   row 2
                              [ 4  2  1  5 ]   row 3

The true answer, computed normally, is one 1 x 4 vector:

Y = X @ W = [ 30  29  22  27 ]

Hold onto [30, 29, 22, 27]. That is the target every core must end up with.

Now split this across four cores along the contraction dimension. This is a row-parallel split: W is cut along its rows, and X is cut along its matching columns. Core i gets column i of X (a single scalar) and row i of W (a length-4 vector):

Core 0:  X_0 = 3    W_0 = [ 5  4  3  2 ]
Core 1:  X_1 = 2    W_1 = [ 1  4  3  2 ]
Core 2:  X_2 = 1    W_2 = [ 1  3  4  2 ]
Core 3:  X_3 = 3    W_3 = [ 4  2  1  5 ]

Each core computes X_i @ W_i, which is just the scalar times the row. The result on every core is a full 1 x 4 vector, the same shape as the final answer, but each one is only a partial sum:

Core 0:  3 * [ 5  4  3  2 ]  =  [ 15  12   9   6 ]
Core 1:  2 * [ 1  4  3  2 ]  =  [  2   8   6   4 ]
Core 2:  1 * [ 1  3  4  2 ]  =  [  1   3   4   2 ]
Core 3:  3 * [ 4  2  1  5 ]  =  [ 12   6   3  15 ]

These four vectors are where every later number in this post comes from. They are computed, not asserted. Verify the target by summing them down the columns:

position 0:  15 +  2 +  1 + 12 = 30
position 1:  12 +  8 +  3 +  6 = 29
position 2:   9 +  6 +  4 +  3 = 22
position 3:   6 +  4 +  2 + 15 = 27

Target: [30, 29, 22, 27]    # accurate

That matches X @ W exactly, which confirms the split is correct. The partials are right, none of them is the answer alone, and now all-reduce has a precise job: take these four partial vectors, add them, and leave every core holding the identical [30, 29, 22, 27].

That word "every" is the whole reason the cost is 2(N-1) and not N-1. Keep watching for it.

The reasoning that makes (N-1) tempting and wrong

Here is the trap. You might say: there are four partial vectors, addition over four items is three pairwise adds, three is N-1, so the cost is N-1. That counts the additions correctly and ignores the part that actually costs anything.

The additions are free. The expensive resource is the link between cores. The real question is never how many adds, it is how many times the data crosses a wire. And the postcondition forces data across wires in two separate directions.

Direction one: the partials have to come together to be summed. Data flows inward and collapses. That is a reduction.

Direction two: the finished total has to come back out to every core, because the postcondition demands every core hold it. Data flows outward and replicates. That is a broadcast.

You cannot skip direction two. Even with a magic free reduction that left the full sum on one core instantly, you would still owe N-1 steps to get that result back out to the other N-1 cores. Reduction inward, broadcast outward. Two directions, two costs. That is the seed of the 2.

Ring all-reduce is the bottleneck-free version of exactly these two movements. It calls direction one reduce-scatter and direction two all-gather. Trace both.

Phase one: reduce-scatter, and where its N-1 comes from

Arrange the four cores in a logical ring, 0 to 1 to 2 to 3 to 0. Every core sends only to its right neighbor and receives only from its left neighbor.

Chop each core's length-4 vector into 4 chunks of one value each. Here is the starting grid. Rows are cores, columns are chunks, and every cell is one of the partial values we just computed above:

          chunk0  chunk1  chunk2  chunk3
Core 0  [  15      12       9       6  ]
Core 1  [   2       8       6       4  ]
Core 2  [   1       3       4       2  ]
Core 3  [  12       6       3      15  ]

The goal of reduce-scatter is narrow. Not to give everyone the full sum, but to leave each core holding exactly one chunk that is fully summed across all four cores, with the four finished chunks scattered one per core. That weaker goal is what makes it cheap.

The send schedule is the part people get wrong, so here it is exactly. On step t, core i sends the chunk with index (i - t) mod 4 to its right neighbor, and adds the chunk arriving from its left into its own matching slot. The chunk index walks backward as t grows. That is deliberate. It makes each chunk accumulate contributions from consecutive cores as it travels forward around the ring.

Step 1, t = 0. Core i sends chunk (i - 0) mod 4 = chunk i.

Core 0 sends chunk0 (=15) to Core 1
Core 1 sends chunk1 (=8)  to Core 2
Core 2 sends chunk2 (=4)  to Core 3
Core 3 sends chunk3 (=15) to Core 0

Each receiver adds the arriving chunk into its own same-index slot:

Core 1 chunk0:  2 + 15 = 17
Core 2 chunk1:  3 +  8 = 11
Core 3 chunk2:  3 +  4 =  7
Core 0 chunk3:  6 + 15 = 21

Grid after step 1, changed cells starred (*):

          chunk0   chunk1   chunk2   chunk3
Core 0  [  15       12        9      *21  ]
Core 1  [ *17        8        6        4  ]
Core 2  [   1      *11        4        2  ]
Core 3  [  12        6       *7       15  ]

Step 2, t = 1. Core i sends chunk (i - 1) mod 4. So core 0 sends chunk3, core 1 sends chunk0, core 2 sends chunk1, core 3 sends chunk2. Each core forwards the chunk it just accumulated, so the running partial keeps moving.

Core 0 sends chunk3 (=21) to Core 1
Core 1 sends chunk0 (=17) to Core 2
Core 2 sends chunk1 (=11) to Core 3
Core 3 sends chunk2 (=7)  to Core 0

Receivers add:

Core 1 chunk3:  4 + 21 = 25
Core 2 chunk0:  1 + 17 = 18
Core 3 chunk1:  6 + 11 = 17
Core 0 chunk2:  9 +  7 = 16

Grid after step 2:

          chunk0   chunk1   chunk2   chunk3
Core 0  [  15       12      *16       21  ]
Core 1  [  17        8        6      *25  ]
Core 2  [ *18       11        4        2  ]
Core 3  [  12      *17        7       15  ]

Step 3, t = 2. Core i sends chunk (i - 2) mod 4. Core 0 sends chunk2, core 1 sends chunk3, core 2 sends chunk0, core 3 sends chunk1.

Core 0 sends chunk2 (=16) to Core 1
Core 1 sends chunk3 (=25) to Core 2
Core 2 sends chunk0 (=18) to Core 3
Core 3 sends chunk1 (=17) to Core 0

Receivers add:

Core 1 chunk2:  6 + 16 = 22
Core 2 chunk3:  2 + 25 = 27
Core 3 chunk0: 12 + 18 = 30
Core 0 chunk1: 12 + 17 = 29

Grid after step 3:

          chunk0   chunk1   chunk2   chunk3
Core 0  [  15      *29       16       21  ]
Core 1  [  17        8      *22       25  ]
Core 2  [  18       11        4      *27  ]
Core 3  [ *30       17        7       15  ]

Read the starred cells. Each one is a fully summed value from the target [30, 29, 22, 27]:

Core 3 owns chunk0 = 30   (correct)
Core 0 owns chunk1 = 29   (correct)
Core 1 owns chunk2 = 22   (correct)
Core 2 owns chunk3 = 27   (correct)

That took exactly 3 steps, which is N-1 for N=4. The complete answer now exists in the system but it is shredded: each core holds one quarter of it, and three quarters of every core's vector is still stale partial data. No core can answer the query yet. This is the precise meaning of reduce-scatter. The reduction is done, the result is scattered, and the operation is only half finished.

Why N-1 and not N? A chunk must visit N cores to collect N contributions, and it starts already sitting on one of them. It needs N-1 hops to reach the other N-1. One free starting position, N-1 paid moves. That is the first N-1.

Phase two: all-gather, and the second N-1

Now the postcondition bites. Every core needs the full [30, 29, 22, 27], but each core holds only one correct chunk. Those correct chunks have to be copied out to everybody. This is all-gather: same ring, same neighbor-only sends, but now there is no addition. Each core forwards a finished chunk, the receiver stores it in the right slot, and the finished chunks circulate until everyone has all four.

Starting state, only the finished chunks shown, dots are slots not yet holding final values:

          chunk0   chunk1   chunk2   chunk3
Core 0  [   .        29        .        .  ]
Core 1  [   .         .       22        .  ]
Core 2  [   .         .        .       27  ]
Core 3  [  30         .        .        .  ]

The rule mirrors phase one. On each step, core i sends the finished chunk it currently owns to its right neighbor, store on arrival, no add. The owned chunk index walks forward by one each step.

Step 1. Each core sends its owned correct chunk right.

Core 3 sends chunk0 (=30) to Core 0
Core 0 sends chunk1 (=29) to Core 1
Core 1 sends chunk2 (=22) to Core 2
Core 2 sends chunk3 (=27) to Core 3

Grid after step 1:

          chunk0   chunk1   chunk2   chunk3
Core 0  [  30        29        .        .  ]
Core 1  [   .        29       22        .  ]
Core 2  [   .         .       22       27  ]
Core 3  [  30         .        .       27  ]

Step 2. Forward the chunk each core just received.

Core 0 sends chunk0 (=30) to Core 1
Core 1 sends chunk1 (=29) to Core 2
Core 2 sends chunk2 (=22) to Core 3
Core 3 sends chunk3 (=27) to Core 0

Grid after step 2:

          chunk0   chunk1   chunk2   chunk3
Core 0  [  30        29        .       27  ]
Core 1  [  30        29       22        .  ]
Core 2  [   .        29       22       27  ]
Core 3  [  30         .       22       27  ]

Step 3. Forward once more.

Core 0 sends chunk3 (=27) to Core 1
Core 1 sends chunk0 (=30) to Core 2
Core 2 sends chunk1 (=29) to Core 3
Core 3 sends chunk2 (=22) to Core 0

Grid after step 3:

          chunk0   chunk1   chunk2   chunk3
Core 0  [  30        29       22       27  ]
Core 1  [  30        29       22       27  ]
Core 2  [  30        29       22       27  ]
Core 3  [  30        29       22       27  ]

Every core now holds [30, 29, 22, 27]. The postcondition is satisfied. That took exactly 3 steps, again N-1, for the same reason as before: each finished chunk lives on one core and has to reach the other N-1, so N-1 forwarding hops.

Adding the two halves

The derivation is now arithmetic on the two phase costs:

reduce-scatter:  N-1 steps   (reduction inward, sum gets scattered)
all-gather:      N-1 steps   (broadcast outward, chunks get gathered)

all-reduce:      (N-1) + (N-1) = 2(N-1) steps

For N=4 that is 3 + 3 = 6 steps, exactly what we traced. The 2 is not a fudge factor and not a redundant second lap. It is one N-1 for collapsing the partials into a scattered sum, and a second N-1 for spreading that sum back out to satisfy the every-core postcondition. Two structurally distinct jobs, two costs, summed.

If anyone sells you N-1 as the cost of all-reduce, they have silently dropped the broadcast. N-1 is the honest cost of reduce-scatter alone, or of all-gather alone, or of a one-directional reduce that leaves the answer on a single core. The instant you require every core to hold the result, you have bought the second N-1, and the cost is 2(N-1).

The same split explains the bandwidth story

The decomposition explains why ring all-reduce is loved for large tensors, not just the step count. Count bytes per link instead of steps.

Let the full tensor be size S. In reduce-scatter, each core sends one chunk of size S/N per step for N-1 steps, so it pushes (N-1) S/N bytes across its link. All-gather is identical traffic, another (N-1) S/N bytes. Total bytes any single link carries:

2 * (N-1) * S/N  =  2S * (N-1)/N

As N grows, (N-1)/N approaches 1, so per-link traffic flattens at about 2S no matter how many cores you add. The 2 here is the same 2 as in the step count, for the same reason: one S of traffic to reduce-scatter, one S to all-gather. The decomposition that explains the latency cost also explains the bandwidth cost. That is not a coincidence, it is the same two-phase structure read through two different meters.

The one sentence to keep

All-reduce equals reduce-scatter plus all-gather. Reduce-scatter costs N-1, all-gather costs N-1, and the postcondition that every core must hold the result is what forces both, so the cost is 2(N-1) and can never be N-1. Memorize the decomposition, not the formula, and the formula rebuilds itself every time.