Vectorizing Simple Algorithms with Auto Vectorization, Vectorized Intrinsics and the Vector API

In this blog post, we will look at some very simple algorithms like fill, copy, map and iota, and see how we can vectorize them to get speedups. I will assume that you already have some basic understanding of automatic vectorization, vectorized intrinsics and the Vector API, otherwise please read this blog post first.

For each of the examples, we will follow these steps:

Reference Implementation: the simplest implementation, just a simple loop. That gives us a ground truth to test other implementations, as well as a performance baseline. We should also check if this simple implementation already is auto vectorized - it may already give us really good performance.
Core library method: see if we can find one with a vectorized intrinsic.
Vector API: only now do we invest the additional time it takes to rewrite the algorithm using the Vector API. In these very simple cases, we will probably not get any speedup, because the reference implementation is already auto vectorized, or the vectorized intrinsic is at least as performant.

Note: all of the benchmarks are integrated in the OpenJDK github repository. Below, I will slightly simplify some of the code so it is easier to read.

It is also important to note: I present some performance numbers for two different machines. We can see that the results differ significantly, and may be different again on your machine. I ran all of these benchmarks on JDK26.

Algorithm 1: Fill

We fill every element in the int array r with the value 42.

Reference implementation (special case: detected as fill loop, and replaced with call to vectorized fill intrinsic):

for (int i = 0; i < r.length; i++) {
    r[i] = 42;
}

Core library method (backed by vectorized intrinsic):

Arrays.fill(r, 42);

Vector API implementation:

var v = IntVector.broadcast(SPECIES_I, 42);
int i = 0;
for (; i < SPECIES_I.loopBound(r.length); i += SPECIES_I.length()) {
    v.intoArray(r, i);
}
for (; i < r.length; i++) {
    r[i] = 42;
}

We broadcast the number 42 to all elements of the vector, and are now able to fill multiple elements of the array with a single vector store (intoArray). We have to use two loops: the length of the array r may not be evenly divisible by the length of the vector v. We use loopBound to determine up to where we can use vectors, and handle the remaining iterations in the scalar cleanup loop. Writing two loops is cumbersome. You could use VectorMask.indexInRange, to generate a mask that is always on except for when you are about to step over the loop bound. But always running with masked operations is expensive. For now, you will get best performance with two loops, but in the future we might find ways to optimize VectorMask.indexInRange for such loops.

Running the benchmark on an array with 10000 elements on my x64 AVX512 laptop, and an aarch64 NEON OCI machine:

fill performance results 10000 elements

Comment: the compiler detects that the reference implementation loop is an array fill operation, and automatically replaces it with a call to the intrinsic! This is a bit of a special case - usually the JVM only uses intrinsic for core library methods, and not hand-written Java loops. We can disable the intrinsic for the loop and also the Arrays.fill with the VM flag -XX:-OptimizeFill. With the intrinsic disabled, now the auto vectorizer kicks in. We can disable the auto vectorizer with the VM flag -XX:-UseSuperWord. Now we get the pure scalar performance.

Observations:

Vectorization is always better compared to scalar performance, but the exact speedup varies.
For fill and this input size, it does not seem to make a difference if we use the intrinsic or auto vectorize.
The Vector API implementation is sensitive to alignment on AVX512: we get a bimodal performance distribution - if the first element of the array is cacheline aligned we get the best performance, but if it is not aligned we get significantly worse performance. Array alignment is essentially random, and so sometimes you get good performance and sometimes not. Auto vectorization and vectorized intrinsics are not sensitive to alignment, because they have an automatic alignment feature. Read more about alignment here. Recommendation for benchmarking: make sure you run the benchmark multiple times, with new allocations of the arrays, so you get the average performance.

Running the same experiment with a smaller array with only 300 elements:

fill performance results 300

The general trends are the same, but with a smaller number of loop iterations the speedup is a little less strong. On AVX512, the intrinsic seems to perform a bit better than the auto vectorizer.

Algorithm 2: Copy

We copy data from array a to array r.

We assume that the two arrays are different arrays. Read this blog post about aliasing, that discusses what we have to do if the arrays might be the same, and the memory that we copy from and to might overlap. There have been some JDK26 improvements in the auto vectorizer to enable the optimization of such possibly aliasing cases.

Reference implementation (should be auto vectorized):

for (int i = 0; i < r.length; i++) {
    r[i] = a[i];
}

Core library method (backed by vectorized intrinsic):

System.arraycopy(a, 0, r, 0, a.length);

Vector API implementation:

int i = 0;
for (; i < SPECIES_I.loopBound(r.length); i += SPECIES_I.length()) {
    IntVector v = IntVector.fromArray(SPECIES_I, a, i);
    v.intoArray(r, i);
}
for (; i < r.length; i++) {
    r[i] = a[i];
}

Running the benchmark on an array with 10000 elements on my x64 AVX512 laptop, and an aarch64 NEON OCI machine:

copy performance 10000

Running the same experiment with a smaller array with only 300 elements:

copy performance 300

The gains from vectorization are significant. The difference between the vectorized alternatives is only marginal - at least for the sizes of arrays we chose here.

Algorithm 3: Map

This is a generalization of copy: we perform some element-wise operation to every element of the input array a, and store the results in an output array r. Here, we multiply by 42, but there is a large range of operators and expressions that can be vectorized. If you find a shape that is not auto vectorized, please report it to us!

Reference implementation (should be auto vectorized):

for (int i = 0; i < r.length; i++) {
    r[i] = a[i] * 42;
}

Vector API implementation:

int i = 0;
for (; i < SPECIES_I.loopBound(r.length); i += SPECIES_I.length()) {
    IntVector v = IntVector.fromArray(SPECIES_I, a, i);
    v = v.mul(42);
    v.intoArray(r, i);
}
for (; i < r.length; i++) {
    r[i] = a[i] * 42;
}

Running the benchmark on an array with 300 elements on my x64 AVX512 laptop, and an aarch64 NEON OCI machine:

map performance 300

Algorithm 4: Iota (Fill with Index Values)

The iota method fills an array with index values 0, 1, 2, ....

Reference implementation (should be auto vectorized):

for (int i = 0; i < r.length; i++) {
    r[i] = i;
}

Vector API implementation:

var iota = IntVector.broadcast(SPECIES_I, 0).addIndex(1);
int i = 0;
for (; i < SPECIES_I.loopBound(r.length); i += SPECIES_I.length()) {
    iota.intoArray(r, i);
    iota = iota.add(SPECIES_I.length());
}
for (; i < r.length; i++) {
    r[i] = i;
}

Running the benchmark on an array with 10000 elements on my x64 AVX512 laptop, and an aarch64 NEON OCI machine:

iota performance 10000

Observations:

On AVX512, auto vectorization produces the best results. The Vector API implementation has the same as we had with fill above: most of the time the stores are unaligned and that produces slower performance. If we disable automatic alignment for the auto vectorizer with -XX:+UnlockDiagnosticVMOptions -XX:SuperWordAutomaticAlignment=0 then the performance of the auto vectorizer and our Vector API implementation are comparable.
On NEON, auto vectorization does not succeed, because the vectorized PopulateIndex node is not implemented in the NEON backend. As far as I can see, this is just a technical limitation, and one could do the same as on x64: simply load the iota values from memory. SVE has assembly instructions that generate the iota vector without accessing memory. The Vector API implementation provides a small speedup, with a fully vectorized loop. While writing this blog post, I filed an issue so we keep track of the possible enhancement of the auto vectorizer by enabling PopulateIndex on NEON.

Conclusion

Vectorization brings clear speedups. The exact speedups depend on the implementation, the hardware (e.g. vector lengths), and some other factors (such as alignment).

These examples were deliberately simple, which means auto vectoirzation is more likely to succeed, and writing an implementation with the Vector API is quite simple. Here some other examples that are a bit more complex:

Please leave a comment below

To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago. This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it by clicking the three dots ... on the top right.