Vectorizing Loops with Control-Flow using the Vector API

Many algorithms require control-flow. As of JDK26, the HotSpot JVM does not auto vectorize loops with control-flow (other than the loop exit check). That will hopefully change in the future, but auto vectorization will always be limited. It is a pattern matching engine and can never cover all cases. That’s where the Vector API (JEP) comes to the rescue: it allows us to express vectorized computations directly in Java, and compiles down to fast SIMD vector instructions. In this blog post, we will look at some relatively simple but powerful and often used algorithms that contain control-flow in the loop iterations.

For this blog post, I assume you are familiar with the basics of vectorization in Java (auto vectorization, vectorized intrinsics and the Vector API). Otherwise, please read this and this.

The benchmarks for all the algorithms can be found in the OpenJDK repository. For each algorithm, we first consider the scalar reference implementation, and then look at one or more optimized implementations that use the Vector API. All of the benchmarks use arrays with 10000 elements.

Note that all the performance numbers are generated with JDK26, and a later JDK version may produce different (hopefully better!) results.

Algorithm 1: lowerCase

This first example is inspired by String::toLowerCase: it converts all characters to lower case. The input is a byte[] array a with ASCII characters, and the goal is to convert all upper case characters to lower case characters, and leave all others unchanged. The result is stored in a separate byte[] array r.

We determine if a character is an upper case character by checking if it is within the range of upper case characters, with a lower bound of 'A' and an upper bound of 'Z'. The transformation to lower case character happens with an addition of 32, which is the difference between the characters 'A' and 'a'.

Reference implementation:

for (int i = 0; i < a.length; i++) {
    byte c = a[i];
    if (c >= 'A' && c <= 'Z') {
        c += ('a' - 'A'); // c += 32
    }
    r[i] = c;
}

Vector API implementation (v1), check lower and upper bound separately:

for (i = 0; i < SPECIES_B.loopBound(a.length); i += SPECIES_B.length()) {
    var vc = ByteVector.fromArray(SPECIES_B, a, i);
    var maskA = vc.compare(VectorOperators.GE, (byte)'A');
    var maskZ = vc.compare(VectorOperators.LE, (byte)'Z');
    var mask = maskA.and(maskZ);
    vc = vc.add((byte)32, mask);
    vc.intoArray(r, i);
}
// omitting scalar cleanup

Vector API implementation (v2), transform input so we only need one bound check:

for (i = 0; i < SPECIES_B.loopBound(a.length); i += SPECIES_B.length()) {
    var vc = ByteVector.fromArray(SPECIES_B, a, i);
    // We can convert the range 65..90 (represents ascii A..Z) into a range 0..25.
    // This allows us to only use a single unsigned comparison.
    var vt = vc.add((byte)-'A');
    var mask = vt.compare(VectorOperators.ULE, (byte)25);
    vc = vc.add((byte)32, mask);
    vc.intoArray(r, i);
}
// omitting scalar cleanup

Here a diagram that visualises the scalar branching computation as well as the vector computation with masks:

visualisation

Running on an x64 AVX512 and an aarch64 NEON machine, using arrays with 10000 elements:

lowerCase all

Focusing on only the Vector API implementations:

lowerCase focused

We see that the Vector API implementations are clearly faster than the scalar implementation. On AVX512 the speedup is 16x - 28x. On NEON the speedup is 12x - 26x. Consider that on AVX512 a vector holds 64 byte elements, and on NEON a vector holds 16 byte elements.

While the y-axis shows the time, the x-axis shows the branch probability. We generate the input randomly, but accordingly to the branch probability:

If the branch probability is high, we mostly have upper case characters (if-branch).
If the branch probability is low, we mostly have lower case characters (else-branch).
If the branch probability is in the middle (0.5), we have a random mix of upper and lower case characters.

Let’s first focus on the performance characteristic of the scalar implementation. We see that the performance depends on the branch probability:

Low (lower case): fastest. The branch predictor works very well, and we don’t have to do any additions.
High (upper case): fast. The branch predictor works very well, but we have to do additions which has some extra cost.
Middle (mixed): slow. The branch predictor fails 50% of the time, we suffer the branch prediction penalty.

The branch predictor is amazing: it allows the CPU to speculate if a branch is taken, based on the history. The CPU speculatively assumes one side of the branch is taken, and already executes down that road. Only later, the check is actually performed. If it goes as speculated: great, we win! Not having to wait for the check to be completed means we save some CPU cycles, it cuts the latency link between the check and the computations of the branch. If the speculation was wrong, the CPU needs to throw away all the results after the wrong branch (pipeline flush), and resume computation at the right branch. This means we just wasted some CPU cycles on the wrong branch, this can be costly.

Now let’s look at the Vector API performance: Both implementations are roughly equally fast. On NEON v2 is slightly faster than v1. The branch probability has no impact on the performance because all vectorized instructions are always executed. The control-flow is simulated by masked operations, but there is no performance impact if the mask entries are true or false. On AVX512 we have some individual down-spikes. I suspect this might be due to (mis)alignment causing multimodal performance.

What we learn from this example: Generally, vectorization leads to speedups because of parallelization - but with a theoretical maximum of the vector length. On NEON where we have 16 byte elements in a vector this would mean we could only gain at most a factor of 16. But using masked vector operations, we can sometimes gain more performance: especially if we can avoid the branch misprediction penalty that the scalar implementation suffers from for some branch probabilities.

Algorithm 2: pieceWise

This example is a bit more contrived, but it shows a very interesting effect. We want to apply a function f(x) to every input in the array a, and store the results in the array r. But the function f is piece-wise: for small input values we compute some multiplications, for high input values we compute some square roots. Here is a plot of the function:

piece-wise function f

It is important to say: computing square roots is very expensive compared to multiplications.

Reference implementation:

for (int i = 0; i < a.length; i++) {
    float ai = a[i];
    if (ai < 1f) {
         float a2 = ai * ai;
         float a4 = a2 * a2;
         float a8 = a4 * a4;
         r[i] = a8;
     } else {
        float s2 = (float)Math.sqrt(ai);
        float s4 = (float)Math.sqrt(s2);
        float s8 = (float)Math.sqrt(s4);
        r[i] = s8;
    }
}

Vector API implementation (v1), unconditionally compute both branches for all lanes:

for (i = 0; i < SPECIES_F.loopBound(a.length); i += SPECIES_F.length()) {
    var ai = FloatVector.fromArray(SPECIES_F, a, i);
    var mask = ai.compare(VectorOperators.LT, 1f);
    var a2 = ai.lanewise(VectorOperators.MUL, ai);
    var a4 = a2.lanewise(VectorOperators.MUL, a2);
    var a8 = a4.lanewise(VectorOperators.MUL, a4);
    var s2 = ai.lanewise(VectorOperators.SQRT);
    var s4 = s2.lanewise(VectorOperators.SQRT);
    var s8 = s4.lanewise(VectorOperators.SQRT);
    var v = s8.blend(a8, mask);
    v.intoArray(r, i);
}
// omitting scalar cleanup

Vector API implementation (v2), unconditionally compute the multiplications for all lanes, but only compute the square roots if at least one lane needs it:

for (i = 0; i < SPECIES_F.loopBound(a.length); i += SPECIES_F.length()) {
    var ai = FloatVector.fromArray(SPECIES_F, a, i);
    var mask = ai.compare(VectorOperators.LT, 1f);
    var a2 = ai.lanewise(VectorOperators.MUL, ai);
    var a4 = a2.lanewise(VectorOperators.MUL, a2);
    var a8 = a4.lanewise(VectorOperators.MUL, a4);
    var v = a8;
    // SQRT is expensive, so only call if it is necessary
    if (!mask.allTrue()) {
        var s2 = ai.lanewise(VectorOperators.SQRT);
        var s4 = s2.lanewise(VectorOperators.SQRT);
        var s8 = s4.lanewise(VectorOperators.SQRT);
        v = s8.blend(a8, mask);
    }
    v.intoArray(r, i);
}
// omitting scalar cleanup

Running on an x64 AVX512 and an aarch64 NEON machine, using arrays with 10000 elements:

piece-wise performance

The scalar implementation has branches, so it is sensitive to the branch probability:

Low (mostly sqrt): slow, because sqrt is slow.
High (mostly mul): fast, because mul is fast.
Middle (mixed): there is the classic branch misprediction penalty bump for NEON, though for AVX512 is is very faint.

The non-branching Vector API implementation (v1) is not sensitive to the branching probability. It is faster than the scalar implementation for some but not all branching probabilities.

The branching Vector API implementation (v2) is the fastest implementation in almost all cases, because it combines the benefits of vectorization with the benefits of branching to avoid square root computation when possible. But on NEON the scalar implementation seems to be slightly faster for branch probabilities around 0.125.

What we can learn from this example: We should not apply vectorization blindly. Using masked vector operations sometimes means we have to execute both branches. If one branch is much more expensive than the other, this can mean that scalar branch prediction outperforms vectorized implementations, at least for some branch probabilities.

Algorithm 3: find

We want to find the first occurence of some element e in an array a. This is inspired by methods like String::indexOf and List::indexOf.

The previous two algorithms were both element-wise (i.e. lane-wise). But here we have an algorithm with an early exit condition.

Reference implementation:

for (int i = 0; i < a.length; i++) {
    if (a[i] == e) {
        return i;
    }
}
return -1;

Below we have an example with the string “voxxeddays”, and we are looking for the first occurence of the character “d”. As long as we have no match we do nothing and keep going. As soon as we find the character “d” we return the current index i.

find visualized

Vector API implementation:

int i = 0;
for (; i < SPECIES_I.loopBound(a.length); i += SPECIES_I.length()) {
    IntVector v = IntVector.fromArray(SPECIES_I, a, i);
    var mask = v.compare(VectorOperators.EQ, e);
    if (mask.anyTrue()) {
        return i + mask.firstTrue();
    }
}
// omitting scalar cleanup
return -1;

In our Vector API implementation, we now load a vector of input values, possibly going beyond the first occurence of “d”. We get a mask that marks all occurences of “d” (compare), then check if we marked any (anyTrue). If we marked none, do nothing. If we marked some, find the first occurence (firstTrue).

Running on an x64 AVX512 and an aarch64 NEON machine, using arrays with 10000 elements:

find benchmark 10000

And using arrays with 300 elements:

find benchmark 300

We see that vectorization is clearly profitable, though the exact factor depends on the CPU and the length of the arrays.

Algorithm 4: mismatch

We want to find the first index at which two byte arrays a and b differ, or -1 if they have no difference. This is inspired by Arrays::mismatch, and related method like Arrays::equals and Arrays::compare that can be implemented using Arrays::mismatch. All of these are backed by vectorized intrinsics. This benchmark is a newer addition to the OpenJDK repository.

Reference implementation:

        for (int i = 0; i < a.length; i++) {
            if (a[i] != b[i]) {
                return i;
            }
        }
        return -1;

Arrays::mismatch implementation:

return Arrays.mismatch(a, b);

MemorySegment::mismatch implementation:

var aMS = MemorySegment.ofArray(a);
var bMS = MemorySegment.ofArray(b);
return (int) aMS.mismatch(bMS);

Vector API implementation:

int i = 0;
for (; i < SPECIES_B.loopBound(a.length); i += SPECIES_B.length()) {
    ByteVector va = ByteVector.fromArray(SPECIES_B, a, i);
    ByteVector vb = ByteVector.fromArray(SPECIES_B, b, i);
    var mask = va.compare(VectorOperators.NE, vb);
    if (mask.anyTrue()) {
        return i + mask.firstTrue();
    }
}
// omitting scalar cleanup
return -1;

The Vector API implementation is very similar to the one for findI: instead of comparing to the search element e, we compare to the value in the other array.

Running on an x64 AVX512 and an aarch64 NEON machine, using arrays with 10000 elements:

mismatch performance 10000

And using arrays with 300 elements:

mismatch performance 300

Observations:

The reference implementation is not vectorized, but all others are.
The Arrays::mismatch and MemorySegment::mismatch implementation seem to be equally performant.
On AVX512, the Vector API implementation uses 512 bit (zmm) registers, but the Arrays::mismatch and MemorySegment::mismatch implementations only seem to use 256 bit (ymm) registers. Accordingly, the Vector API implementation is about 2x as fast. I suspect that the vectorized intrinsics have not yet been adjusted for AVX512.
On my NEON machine the Vector API implementation seems to be slightly slower than the Arrays and MemorySegment implementation, which both use ArraysSupport::vectorizedMismatch and use Unsafe.getIntUnaligned, to perform 64-bit accesses. I have not further investigated why that is faster.
This benchmark is very similar to findI. But mismatchB two byte arrays, and in findI accesses one int array. With mismatchB we can put more elements in a vector, and so we get greater speedups.

Algorithm 5: filter

The previous algorithms were either element-wise, where all lanes were independent, or they had an early exit but no side-effects. The filter example takes an input array a and keeps all elements that are at or above the threshold, rejecting all other elements. The elements that are kept are stored contiguously into an output array r. The current position in the output array is tracked with the secondary index variable j.

Reference implementation:

int j = 0;
for (int i = 0; i < a.length; i++) {
    int ai = a[i];
    if (ai >= threshold) {
        r[j++] = ai;
    }
}

In the example below, we have a threshold = 0, so we are keeping all positive values (including zero). Note that the values are copied “down”, we always have j <= i. The secondary index variable j is only sometimes incremented, while the primary index variable i is always incremented.

filter visualized

The variable j poses a challenge for parallelization, because it counts how many of the previous iterations had a true-condition. This is a so-called control-dependent loop-carried dependency: it depends on the result of the control-flow of previous iterations. Luckily, some CPUs have hardware instructions that perform such “filtering” on a per-vector level, and they are available in the Vector API as compress operations.

Vector API implementation (v1), using compress:

int j = 0;
int i = 0;
for (; i < SPECIES_I.loopBound(a.length); i += SPECIES_I.length()) {
    IntVector v = IntVector.fromArray(SPECIES_I, a, i);
    var mask = v.compare(VectorOperators.GE, threshold);
    v = v.compress(mask);
    var prefixMask = mask.compress();
    v.intoArray(r, j, prefixMask);
    j += mask.trueCount();
}
// omitting scalar cleanup

In the example below, we load a vector of 4 elements, and create a mask that marks which elements we want to keep. We compress both the elements we want to keep, as well as the mask. Now the elements we keep and the mask are “compressed” to the beginning of the vector, and they are all adjacent. Now that we have managed to make our elements adjacent, we can use a masked store operation to only store those elements to the output. Finally, we count how many elements we stored, so we can update our secondary index variable j accordingly. Note: while we always consume a full vector of inputs (here 4 elements), we may store an arbitrary number of elements (here 0-4).

filter visualized compress

Sadly, not all CPUs have those “filtering” assembly instructions, and so compress is not (yet) handled well by the Vector API, and we resort to a very slow fallback implementation. Also, some CPUs don’t even have masked store instructions. Therefore, we also investigate an alternative approach below that does not require any compress and also no masked store operation.

Vector API implementation (v2_l2), using a 2-element vector and only vectorizing the all-true case:

int j = 0;
int i = 0;
for (; i < SPECIES_I64.loopBound(a.length); i += SPECIES_I64.length()) {
    IntVector v = IntVector.fromArray(SPECIES_I64, a, i);
    var mask = v.compare(VectorOperators.GE, threshold);
    if (mask.allTrue()) {
        v.intoArray(r, j);
        j += 2;
    } else if (mask.anyTrue()) {
        if (mask.laneIsSet(0)) { r[j++] = v.lane(0); }
        if (mask.laneIsSet(1)) { r[j++] = v.lane(1); }
    } else {
        // nothing
    }
}

Note: we can generalize the v2 implementation to more elements (4-element: v2_l4, 8-element: v2_l8). In each case, we have an all-true branch that can do a store without mask, and an all-false branch that does nothing. We call those two branches the uniform branches, because the mask is either uniform-true or uniform-false. The mixed branch we call divergent: we have a scalar implementation where every lane is simulated with individual control flow. Some will take the respective true-branch, others the respective false-branch (i.e. do nothing).

Running on an x64 AVX512 and an aarch64 NEON machine:

filter performance all

The results above show that some Vector API implementations are significantly slower than all other implementations. Observations on the slow cases:

On x86 machines, we unfortunately do not yet allow 2-element masks, and so the v2_l2 implementation resorts to a slow scalar Java fallback in the Vector API. We hope to allow 2-element masks in the future.
On aarch64 NEON, the compress operation is not directly supported in hardware, and we resort to a slow Java fallback implementation in the Vector API. Also the v2_l8 implementation resorts to the slow fallback, because we requested a vector length of 256 bits but only 128 bit vectors are available. We are currently considering solutions to these issues.
For all of the slow cases, we can see that performance is sensitive to the branch probability (i.e. how many elements we keep). This is because the fallback implementation in the Vector API simulates all the vector operations with scalar operations, and uses control-flow to do so (e.g. for compress and masked store operation).

Below, we focus on only the fast cases:

filter performance fast

Observations:

The scalar performance has the classic branch-prediction performance characteristic. If most elements are rejected, it is fastest. If most elements are taken, it is slightly slower because we need to store the elements. The worst performance happens with Branch probability 0.5, because of the branch misprediction penalty.
The Vector API implementation (v1) that uses compress is the only implementation that is not sensitive to the branch probability, as all operations are executed unconditionally. It seems to generally be the most performant implementation, except for v2_l8 with very extreme branch probabilities.
The v2 implementations share the same general performance characteristics, which are also similar to the scalar performance characteristic. This is because v2 has branches, and thus is sensitive to branch prediction behavior. The longer the vector length, the better the performance in the extreme (high and low) branch probability region, because more elements can be taken or rejected at a time. The 8-element vector v2_l8 implementation thus can even outperform the compress implementation, because compress and masked store operations are more expensive than non-masked store operations.
On aarch64 NEON, it seems the scalar performance is the best. I have not yet investigated why, but suspect it might be that the vectorized allTrue and anyTrue checks contribute to the overhead.

Conclusion

We showed that the Vector API is capable to vectorize some basic yet very powerful algorithms with control-flow in the loop. Understanding of the branch prediction is very important in some of these cases: the performance depends on the inputs, especially when there are branch prediction penalties. We saw some cases where the control-flow is element-wise (lowerCase, pieceWise), some with an early exit (find, mismatch), and even a case with control-dependent loop-carried dependency (filter). We have seen that different vectorized implementations have different performance characteristics, and in some cases the reference implementation remains the fastest implementation we considered for a specific case and branch probability. We have also seen that sometimes the Vector API fails its goal of “graceful degradation”: we resort to a Java fallback that is horribly slow. There are plans to address that in the future. But for now, the recommendation is to implement a reference implementation and a vectorized implementation, to benchmark them on each relevant platform, and then run the fastest on implementation on the respective platform.

Note that all the performance numbers are generated with JDK26, and a later JDK version may produce different (hopefully better!) results.

Please leave a comment below

To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago. This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it by clicking the three dots ... on the top right.