Aliasing is one of the biggest obstacles to vectorization and auto vectorization in particular. As we will see further below, it is also relevant for uses of the Vector API.

Aliasing is about overlap of memory regions, which can lead to problems when we try to reorder memory accesses for vectorization. But let’s take it step by step and start with an example:

for (int i = init; i < limit; i++) {
    b[i] = a[i];
}

To vectorize the loop above, we first unroll the loop, and then reorder the loads and stores, and pack them into vectorized loads and stores:

image

But how can we prove that the reordering is ok, i.e. that we can move all the loads to the top and the stores to the bottom? If a and b refer to different arrays, we know that the loads are not affected by the stores, and thus it is ok to move the loads to the top. If a = b, then we need to look at the indices of the array accesses. We can see that the accesses of iteration i go only to element i, and accesses of iteration i+1 go only to element i+1. This means that operations of different iterations cannot interfere with each other, and thus we can move all the loads to the top.

When the Compiler cannot prove that there is no Aliasing

Now let’s look at another example where reordering would be problematic:

for (int i = init; i < limit; i++) {
    b[i] = a[i + offset];
}

If the two references a and b point to different arrays, we can vectorize without any concerns. But the compiler may not be able to prove that a and b do not alias, i.e. point to the same array. For example, we may have the following scenario at runtime, where a = b and offset = -1, in which case each iteration loads a value from the last element and copies it to the current one:

image

As visualized below, this means that the first element is copied forward in every iteration, so there is a dependency between the iterations. The implication is that the iterations are decidedly not parallel, and so naive vectorizaton will lead to wrong results, as we can see in the lower half of the graphic below:

image

So even if it is very rare that a = b and the offset is such that vectorization leads to wrong results, the compiler cannot just ignore that (faint) possibility - the generated code needs to be always correct!

For arrays, many loop shapes are “nice” where every iteration only accesses the i’th element, and does not have such an arbitrary offset that makes compile-time aliasing analysis succeed. But for MemorySegments, the problem is much worse: we cannot prove that there is no aliasing for even “nice” looking examples like these:

for (long i = 0; i < a.byteSize(); i++) {
    byte v = a.get(ValueLayout.JAVA_BYTE, i);
    b.set(ValueLayout.JAVA_BYTE, i, v);
}

Where with arrays, a and b are either different and have no overlap, or identical. But for MemorySegments, the starting address is essentially arbitrary, and so a and b could also partially overlap:

image

This means that the problem of aliasing is even more accute for MemorySegments than for arrays.

Aliasing Runtime Checks to the Rescue! (JDK26)

In JDK26, I was able to implement aliasing runtime checks (JDK-8324751).

While we cannot prove non-aliasing at compile-time for many cases, we can check for non-aliasing at runtime (for a small overhead). We check if there is aliasing/overlap, and only if the check passes do we enter the vectorized loop.

But how to compute if there is aliasing? For each pair of memory accesses that we need to reorder and cannot prove at compile-time that there is no aliasing, we have a formula that checks that there is no overlap. So far, all memory accesses follow a linear form, so we can easily visualize some cases:

image

We essentially have pairs of lines, or rather bands, as each memory access has a pointer where it starts as well as a size. A little simplified, we get the following formulation:

  • start1(i) = a1 * i + b1, and some size1, where the access in the i’th iteration goes to [start1, start1 + size1).
  • start2(i) = a2 * i + b2, and some size2, where the access in the i’th iteration goes to [start2, start2 + size2).
  • For all i int the loop range:
    • start1(i) >= start2(i) + size2 or
    • start1(i) + size1 <= start2(i)

This condition checks that either the first access always happens above the second, or the first access always happens below the second, which implies that there cannot be any overlap.

It is reasonable to assume that this check will fail only rarely. But what should we do if it does fail? Many compilers resort to compiling two loops: one that is vectorized and one that stays scalar. I call this the multiversioning approach, where we have multiple versions of the loop, that allows us to make assumptions for one loop (no aliasing) and optimize more aggressively (vectorize), while the other loop has no assumptions and therefore fewer optimizations. The downside with multiversioninig is that we have more code, which means we spend more time compiling and more code space, which can hurt instruction cache locality.

But since C2 is a JIT compiler, we have a powerful tool available: deoptimization! We can speculate that the aliasing runtime check never fails. If the check passes, we go to the vectorized loop. But if the check fails, we deoptimize, and go back to the interpreter. The benefit is that we spend less time compiling, and use up less code space and get better instruction cache locality. The downside is that if our speculation is ever wrong, if we ever do encounter an aliasing case, we pay a high penalty: we get the very slow performance of the interpreter, which is much slower than the scalar loop we would have entered with multiversioning.

Both solutions on their own are not super satisfactory. That is why we implement a hybrid approach that combines the both. At first, we use the speculation/deoptimization approach, in the hope that the aliasing assumption never fails. For most loops, this should always hold. We get great performance, faster compilation and good code cache locality. In the few cases where the assumption fails, we have to pay a penalty, but would like to eventually reach great peak performance for both the aliasing and non-aliasing case. So we deoptimize, and eventually recompile with multiversioning. This way, we get the great vectorized performance for non-aliasing cases, and still very good performance of the compiled scalar loop.

As we can see in the benchmarks of JDK-8324751, this leads to great speedups for a lot of loop shapes, especially copy patterns for arrays and even more loop shapes that use multiple MemorySegment accesses. Depending on the number of elements we can pack into the vectors, and the However, there are some 10-20% regressions for cases where we alias all the time (JDK-8366274). I suspect that at least part of the explanation is due to lower instruction cache locality, but I have not yet been able to do a thorough performance investigation. In my opinion it is an acceptable regression for such an edge case: as soon as we have any non-aliasing invocations of the loops mixed in, the speedups from vectorization for those cases completely outweigh the small performance regression.

TODO: performance results

Aliasing and Vector API

If you are using the Vector API to speed up your code, you may have to consider the edge case of aliasing and implement solutions for it yourself.

import java.util.Arrays;
import jdk.incubator.vector.*;

public class Test {
    private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128;

    public static void main(String[] args) {
        System.out.println("test1:");
        int[] x = generate();
        print(x);
        test1(x, x, 2, 20);
        print(x);
        System.out.println("test2:");
        int[] y = generate();
        print(y);
        test2(y, y, 2, 20);
        print(y);
    }

    public static int[] generate() {
        int[] a = new int[22];
        Arrays.setAll(a, i -> 100 + i);
        return a;
    }

    public static void print(int[] a) {
        System.out.print("[");
        for (int i = 0; i < a.length; i++) {
            System.out.print(" " + a[i]);
        }
        System.out.println("]");
    }

    public static void test1(int[] a, int[] b, int offset, int length) {
        for (int i = 0; i < length; i++) {
            b[i + offset] = a[i];
        }
    }

    public static void test2(int[] a, int[] b, int offset, int length) {
        for (int i = 0; i < SPECIES.loopBound(length); i += SPECIES.length()) {
            IntVector v = IntVector.fromArray(SPECIES, a, i);
            v.intoArray(b, i + offset);
        }
    }
}

Above, test2 looks like the straight-forward vectorization of test1, using the Vector API. But the results are not the same, because of aliasing:

java Test.java
test1:
[ 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[ 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101 100 101]
test2:
[ 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121]
[ 100 101 100 101 102 103 102 103 106 107 106 107 110 111 110 111 114 115 114 115 118 119]

Because offset = 2, test1 produces a 2-element repetitive pattern. But for test2, we implicitly reordered the loads and stores, and since there is aliasing, this leads to different results.

Links

YouTube video that covers pointers and aliasing: Section on Pointers and Aliasing in JVMLS 2025 Talk

Please leave a comment below

To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago. This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it by clicking the three dots ... on the top right.