JDK26: Auto Vectorization and Aliasing
Aliasing is one of the biggest obstacles to vectorization and auto vectorization in particular. As we will see further below, it is also relevant for uses of the Vector API.
Definition Aliasing: two memory accesses are said to alias if their accessed memory regions overlap.
As we will see, auto vectorization requires the reordering of loads and stores. Reordering a load and a store, or reordering two stores is only safe if these two memory accesses do not alias (overlap).
Let’s take it step by step and start with an example:
for (int i = init; i < limit; i++) {
b[i] = a[i];
}
To auto vectorize the loop above, we first unroll the loop, and then reorder the loads and stores, and pack them into vectorized loads and stores:
But how can we prove that the reordering is ok, i.e. that we can move all the loads to the top and the
stores to the bottom?
If a and b refer to different arrays, we know that the loads are not affected by the stores,
and thus it is ok to move the loads to the top.
If a = b, then we need to look at the indices of the array accesses.
We can see that the element i is only accessed in iteration i, and the element i+1 only in iteration i+1.
This means that operations of different iterations cannot interfere with each other, and thus we can
move all the loads to the top.
When the Compiler cannot prove that there is no Aliasing
Now let’s look at another example where reordering could be problematic:
for (int i = init; i < limit; i++) {
b[i] = a[i + offset];
}
If the two references a and b point to different arrays, we can vectorize without any concerns.
But the compiler may not be able to prove that a and b do not alias, i.e. point to the same array.
For example, we may have the following scenario at runtime, where a = b and offset = -1,
in which case each iteration loads a value from the last element and copies it to the current one:
As visualized below, this means that the first element is copied forward in every iteration, so there is a dependency
between the iterations. The implication is that the iterations are decidedly not parallelizable, and so naive vectorization
will lead to wrong results, as we can see in the lower half of the graphic below. On the left, we depict the values
in the array before the execution of the loop, on the right the result. A correct execution copies the 0th element
with value 0 forward in each iteration, filling the whole array with 0. A wrongly vectorized execution copies
4 elements at a time, which leads to wrong results.
So even if it is very rare that a = b and the offset is such that vectorization leads to wrong results,
the compiler cannot just ignore that (faint) possibility - the generated code needs to be always correct!
For arrays, many loop shapes are “nice” where every iteration only accesses the i’th element,
and does not have such an arbitrary offset that makes compile-time aliasing analysis fail.
But for MemorySegments,
the problem is much worse: we cannot prove that there is no aliasing for even “nice” looking examples like these:
for (long i = 0; i < a.byteSize(); i++) {
byte v = a.get(ValueLayout.JAVA_BYTE, i);
b.set(ValueLayout.JAVA_BYTE, i, v);
}
Where with arrays, a and b are either different and have no overlap, or identical.
But for MemorySegments, the starting address is essentially arbitrary, and so a and b could also partially overlap:
This means that the problem of aliasing is even more pressing for MemorySegments than for arrays.
Aliasing Runtime Checks to the Rescue! (JDK26)
In JDK26, I was able to implement aliasing runtime checks (JDK-8324751).
While we cannot prove non-aliasing at compile-time for many cases, we can check for non-aliasing at runtime (for a small overhead). We check if there is aliasing/overlap, and only if the check passes do we enter the vectorized loop.
But how to compute if there is aliasing? For each pair of memory accesses that we need to reorder and cannot prove at compile-time that there is no aliasing, we have a formula that checks that there is no overlap. So far, we have only vectorized memory accesses with a linear address form. Of course Java does not expose addresses directly, but the compiler needs to convert accesses to arrays, MemorySegments etc. to addresses.
bytearray:a[i]->a.base + con_offset_to_first_element + iintarray:a[i + 10]->a.base + con_offset_to_first_element + 4L * (i + 10)- MemorySegment with off-heap memory:
ms.get(ValueLayout.JAVA_DOUBLE_UNALIGNED, 8L * i)->ms.address + 8L * i
In the examples above, we assume that i is the loop counter, and we can see that the address progresses
linearly as we increment i over the loop iterations. Let us visualize some examples, where each iteration has
two memory accesses that could possibly overlap:
We essentially have pairs of lines, or rather bands, as each memory access has a pointer where it starts as well as a size. A little simplified, we get the following formulation:
start1(i) = a1 * i + b1, and somesize1, where the access in the i’th iteration goes to[start1, start1 + size1).start2(i) = a2 * i + b2, and somesize2, where the access in the i’th iteration goes to[start2, start2 + size2).- For all
iin the loop range:start1(i) >= start2(i) + size2orstart1(i) + size1 <= start2(i)
This condition checks that either the first access always happens above the second, or the first access always happens below the second,
which implies that there cannot be any overlap.
Note that this condition goes over all iterations i, so we will generate a check that can be executed once before the loop.
If the check passes, we know that there is no aliasing for any loop iteration.
Note that we may need multiple such checks if there are multiple pairs of possibly aliasing memory references. Some examples:
- Fill pattern: only has one store per iteration, trivially no aliasing.
- Copy pattern: a load and a store, requires one check.
- Addition of two vectors: 2 loads and 1 store, two checks, one per load-store pair (load-load pair is always safe to reorder).
- 4 stores: need check for each store-store pair, a total of 6 checks.
It is reasonable to assume that the aliasing check will fail only rarely. But what should we do if it does fail? Many compilers resort to compiling two loops: one that is vectorized and one that stays scalar (non-vectorized). I call this the multiversioning approach, where we have multiple versions of the loop, that allows us to make assumptions for one loop (no aliasing) and optimize more aggressively (vectorize), while the other loop has no assumptions and therefore fewer optimizations. The downside with multiversioning is that we have more code, which means we spend more time compiling and more code space, which can hurt instruction cache locality.
But since C2 is a JIT compiler, we have a powerful tool available: deoptimization! We can speculate that the aliasing runtime check never fails. If the check passes, we go to the vectorized loop. But if the check fails, we deoptimize, and go back to the interpreter. The benefit is that we spend less time compiling, and use up less code space and get better instruction cache locality. The downside of only using speculation and never multiversioning is that if our speculation is ever wrong, if we ever do encounter an aliasing case, we pay a high penalty: we get the very slow performance of the interpreter, which is much slower than the scalar (non-vectorized) loop we would have entered with multiversioning.
Both solutions on their own are not super satisfactory. That is why we implement a hybrid approach that combines them both. At first, we use the speculation/deoptimization approach, in the hope that the aliasing assumption never fails. For most loops, this should always hold. We get great performance, faster compilation and good code cache locality. In the few cases where the assumption fails, we have to pay a penalty, but would like to eventually reach great peak performance for both the aliasing and non-aliasing case. So we deoptimize, and eventually recompile with multiversioning. This way, we get the great vectorized performance for non-aliasing cases, and still very good performance of the compiled scalar (non-vectorized) loop.
As we can see in the benchmarks of JDK-8324751, this leads to great speedups for a lot of loop shapes, especially copy patterns for arrays and even more loop shapes that use multiple MemorySegment accesses. However, there are some 10-20% regressions for cases where we alias all the time (JDK-8366274). I suspect that at least part of the explanation is due to lower instruction cache locality, but I have not yet been able to do a thorough performance investigation. In my opinion it is an acceptable regression for such an edge case: as soon as we have any non-aliasing invocations of the loops mixed in, the speedups from vectorization for those cases completely outweigh the small performance regression.
I ran the VectorBulkOperationsMemorySegment.copy_byte_loop benchmark
on my AVX512 laptop. We can see that with the new aliasing runtime check, JDK26 has significant speedups, especially
with byte array sizes above about 64 bytes. In the range around 10 bytes, here are some slight regressions,
which is a known issue with our auto vectorization approach, and under active investigation.
Aliasing and Vector API
If you are using the Vector API to speed up your code, you may have to consider the edge case of aliasing and implement solutions for it yourself.
import java.util.Arrays;
import jdk.incubator.vector.*;
public class Test {
private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128;
public static void main(String[] args) {
System.out.println("test1:");
int[] x = generate();
print(x);
test1(x, x, 2, 20);
print(x);
System.out.println("test2:");
int[] y = generate();
print(y);
test2(y, y, 2, 20);
print(y);
}
public static int[] generate() {
int[] a = new int[22];
Arrays.setAll(a, i -> 100 + i);
return a;
}
public static void print(int[] a) {
System.out.print("[");
for (int i = 0; i < a.length; i++) {
System.out.print(" " + a[i]);
}
System.out.println("]");
}
public static void test1(int[] a, int[] b, int offset, int length) {
for (int i = 0; i < length; i++) {
b[i + offset] = a[i];
}
}
public static void test2(int[] a, int[] b, int offset, int length) {
for (int i = 0; i < SPECIES.loopBound(length); i += SPECIES.length()) {
IntVector v = IntVector.fromArray(SPECIES, a, i);
v.intoArray(b, i + offset);
}
}
}
Above, test2 looks like the straight-forward vectorization of test1, using the Vector API.
But the results are not the same, because of aliasing:
java Test.java
test1:
[ 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[ 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101]
test2:
[ 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[ 100 101 100 101 102 103 102 103 106 107 106 107 110 111 110 111 114 115 114 115 118 119]
Because offset = 2, test1 produces a 2-element repetitive pattern.
But for test2, we implicitly reordered the loads and stores, and since there is aliasing,
this leads to different results.
Links
YouTube video that covers pointers and aliasing: Section on Pointers and Aliasing in JVMLS 2025 Talk
Please leave a comment below
To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago.
This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it
by clicking the three dots ... on the top right.