Vectorizing Reductions: from JDK9 to JDK26 and beyond
There are a lot of examples for reductions: computing the sum, product or min over some list of numbers,
the dot product (sum of element-wise products),
or even the polynomial hash over a string.
I have heard that the term “reduction” comes from “dimensionality reduction”, of crunching down a higher-dimensianal collection of numbers to a lower-dimensional one. We may for example reduce a 2-dimensional matrix to a 1-dimensional vector. In this post here, we are interested in reducing 1-dimensional collections (list, vector, array) to a 0-dimensional scalar.
Examples
We may want to take the sum of the elements in an array:
for (int i = 0; i < a.length; i++) {
sum += a[i];
}
Similarly, we can compute the dot-product of two arrays:
for (int i = 0; i < a.length; i++) {
sum += a[i] * b[i];
}
Parallel Reductions
Computing such a reduction looks like a very sequential task: we can only compute the next addition if all the previous additions are complete. We get this very long sequential chain of reduction operations, and this chain completely determines the latency of our computation. However, one can compute some reductions in parallel.
If we have N threads, we can chop our array into N chunks, and each thread computes the sum over its chunk.
Now, we only need to sum up the N partial sums to get the total sum.
This can speed up our computation by a factor of up to N, since the partial sums from the chunks can be computed in parallel,
and we only have some (hopefully negligible) coordination effort at the beginning (chunking) and end (reduce partial sums to total sum).
Here some examples using Java Parallel Streams
that do some interesting numerical work.
However, there is a necessary condition for parallelizing reductions: the reduction operator must be associative, and as we will see it also needs to be commutative for some optimizations. At least if we want to get the exact same results if we compute sequentially or in parallel. As a reminder:
- Associative:
(x + y) + z = x + (y + z) - Commutative:
x + y = y + x
Or simply put: we must be able to re-order the reduction. In our example with N=2 threads, we have done the following transformation:
- Sequential:
((x0 + x1) + x2) + x3 - Parallel:
(x0 + x1) + (x2 + x3)
There are a lot of operators that satisfy these properties, for example: integer addition/multiplication, min/max, logical and/or/xor. However, there are some that do not, for example: floating-point addition and multiplication. The reasond is the rounding error:
a = 1e30f,b = -1e30fandc = 1e-30fa + b + c = 1e-30f:a + b = 0without any rounding.0 + c = 1e-30fwithout any rounding.a + c + b = 0:a + cgets rounded back toabecausecis too small relative toa.
That does not stop us from pretending that the operation is associative and commutative, we may just have to live with the rounding errors (see DoubleStream.sum or clang -ffast-math).
Vectorizing Reductions
While we can get speedups for reductions using multiple threads, we can also use our CPU’s SIMD vector registers, which open parallelism within a single thread.
Let’s look the sum over an array of ints. Below, I show the
C2 intermediate representation of the loop. We initialize our sum variable with a 0, and in each iteration,
we add a new value v (the i’th array element) to our sum.
But this is a completely sequential implementation of our sum, and the execution time would be dominated
by the latency of the sequential reduction chain.
In an attempt to optimimize, we could unroll the loop (e.g. by a factor of 4):
However, so far we still have a sequential reduction chain.
We can probably vectorize the values [v0, v1, v2, v3], since they can be computed in parallel (element-wise).
But what can we do about the additions of the reduction?
They are decidedly not parallel, and clearly their operations would cross lanes!
A naive implementation would be to load our v values with a vector load operation LoadV,
and then sequentially extract every element from the vector, and adding up the values in sequential order.
This has one benefit: we have not had to reorder the reduction, and so this approach can be used for any
operator (including floating-point addition and multiplication). But the downsides are quite severe:
rather than decreasing the number of instructions, we have increased them (n additions vs n additions and n extracts).
And we still have the same issue with the latency of the sequential reduction.
A more performant approach is the recursive folding reduction: we load our values into a single vector register,
and then split into two halves and add the two halves element-wise. This gives us a new register with
only half the number of elements. We repeat the split-and-element-wise-add approach, reducing the number
of elements in every step, until we have a single element left.
This allows us to reduce the n-element vector to its partial sum in log(n) steps.
Our scalar loop variable sum now only needs a single addition for each partial sum.
This helps us with our latency problem: we now only have 1/n‘th as many addition operations
on the critical latency path of the sum variable.
In JDK9, reductions were first auto vectorized (JDK-8074981). For the operations that permit it, the recursive folding was used. For the other operations (float/double add/mul) we use the sequential reduction, where all elements have to be extracted from the vector. But once the patch was integrated, it was discovered that there were some speedups, but sadly also some performance regressions:
In the plot above, we can see the performance numbers for add and mul reductions on int, long, float and double arrays.
Any performance below 1 indicates a speedup for vectorization, and performance above 1 indicates a slowdown
(scalar would have been faster).
Still in JDK9, it was therefore determined that “simple” reductions and 2-element vector reductions are not all profitable,
and their vectorization was disabled (JDK-8078563).
Any further analysis was postponed at this time, and not taken up for a long time.
Futher Optimizing Reduction Vectorization
In JDK21, I was able to optimize some vectorized reductions, by moving the expensive cross-lane
reduction after the loop, and using lane-wise vector accumulators inside the loop
(JDK-8302652).
The idea is to use a vector-accumulator instead of a scalar sum variable.
We accumulate partial sums in the elements, and only after the loop do we reduce the partial
sums of the elements down to the total sum.
This means that the expensive cross-lane reduction operation can be moved outside the loop
and is only executed once, and inside the loop we can simply use element-wise operations.
Below, we see how this works on an add-reduction. The left illustrates the vectorized reduction
with the AddReductionV (reduction across lanes using recursive folding) inside the loop.
The AddReductionV is implemented with many assembly instructions which makes it expensive/slow.
The right illustrates the optimized vectorized reduction. Instead of a scalar sum variable,
we have a vectorized accV register that holds the partial sums. We initialize the vector
to all zeros, and elemen-wise add (AddV) the vector v of new values to our accV.
Once the loop has completed, the accV needs to be reduced down to a single value
using AddReductionV only once. This makes the execution of the loop much faster,
because instead of executing many instructions per iteration for a AddReductionV,
we only have a single assembly instruction for AddV.
JDK26: Finally Vectorizing Simple Reductions
With JDK26, C2 is finally auto vectorizing simple reductions by default, using a cost-model that ensures we only
vectorize reductions when it is profitable (JDK-8340093).
The cost-model adds up the cost of every IR node inside the loop. In the cases where we were able
to move the AddReductionV outside the loop, we do not have to count its cost (it would be high!),
and instead only count the cost of the much cheaper AddV.
As a consequence, we can now expect really good performance for reductions with primitive types
int, long, float and double, and the operators min, max, and, or and xor.
And for add and mul we get good fast performance for int and long.
But for float and double we cannot make add and mul faster than the scalar performance,
because we cannot reorder the reduction due to rounding issues.
There is still a small caveat: the input values to the reduction must also be auto vectorizable,
and that currently only works if we can perform consecutive loads from an array or MemorySegment.
TODO: performance results
Vectorized Reductions in the Vector API
Since JDK16, the Vector API is available in incubator state.
The reduceLanes
operation allows us to reduce a vector to a scalar.
For most operations, the recursive folding approach is used, even for floating-point reductions
(see reduceLanes).
The implication for float and double reductions for add and mul is that the order of reduction
is explicitly not specified, and the result may be computed with a different order each time,
delivering different rounding results for different invocations.
For example, if executed with the interpreter, the reduction order may be sequential, but when compiled
the order could be a recursive folding order.
TODO: code and performance results.
Links
In JDK9, reductions were first auto vectorized (JDK-8074981). But immediately, some performance regressions were discovered, and so auto vectorization for simple reductions and 2-element-vector reductions were disabled (JDK-8078563).
In JDK21, I was able to optimize some vectorized reductions, by moving the expensive cross-lane reduction after the loop, and using lane-wise vector accumulators inside the loop (JDK-8302652).
With JDK26, C2 is finally vectorizing simple reductions by default, using a cost-model that ensures we only vectorize reductions when it is profitable (JDK-8340093).
YouTube video that covers the same material as this blog post: Section on Reductions in JVMLS 2025 Talk
Please leave a comment below
To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago.
This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it
by clicking the three dots ... on the top right.