I assume that you have already looked at Part 0, Part 1, Part 2, and Part 3.

In Part 4, we look at:

  • PhaseIdealLoop / loop optimizations.

Skip forward to Part 5

Introduction to Loop Optimizations

In Part 3, I already presented a similar overview.

PhaseLoopOpts analyses the loop structures and reshapes them. We do this work in multiple loop opts phases, iteratively we analyze the loops and transform them, then let them be cleaned up by IGVN, and then attempt another loop opts phase.

Here some example optimizations:

  • Detection of loops, canonicalization to CountedLoop (loop trip-count phi does not overflow).
  • Turning long counted-loops into int counted-loops, and long RangeChecks into int RangeChecks if possible.
  • Remove empty loops.
  • Peeling: make a single-iteration copy of the loop, which is executed as straight-line code before the loop. That allows for some invariant checks to only be executed in the peeled iteration, and removed from the loop body.
  • Loop predication: move some checks (e.g. null-checks, RangeChecks) before the loop.
  • Unswitching: move a loop-invariant check before the loop, copy the loop: one with the true-path, one with the false-path only.
  • Pre-Main-Post-loop:
    • Unrolling of main loop.
    • Pre loop used for alignment (on architectures where strict alignment for vectorization is required).
    • Post loop used to handle left-over iterations.
    • RangeCheck elimination: main loop handles iterations where the RangeCheck is known to pass, pre and post loop handle the iterations before and after.
  • Auto vectorization (main loop).

Additionally, there are some “splitting” optimizations: we split operations (e.g. add) through phis, i.e. to both merged branches, if our heuristic says that this is profitable (e.g. can be constant folded on one path). We also split ifs through phis. We do these “splitting” operations during loop opts because this allows us to have dominator information that is not available during IGVN.

Example with TraceLoopOpts

Let us look at the example from Part 3, and enable loop opts tracing.

public class Test {
    public static void main(String[] args) {
        // Especially with a debug build, the JVM startup can take a while,
        // so it can take a while until our code is executed.
        System.out.println("Run");

        // Repeatedly call the test method, so that it can become hot and
        // get JIT compiled.
        int[] array = new int[10_000];
        for (int i = 0; i < 10_000; i++) {
            test(array);
        }
        System.out.println("Done");
    }
    
    public static void test(int[] array) {
        // Add 42 to every element in the array: load, add, store
        for (int i = 0; i < array.length; i++) {
            array[i] += 42;
        }
    }
}

We run it with -XX:+TraceLoopOpts:

java -XX:CompileCommand=printcompilation,Test::* -XX:CompileCommand=compileonly,Test::test -Xbatch -XX:-UseOnStackReplacement -XX:+TraceLoopOpts Test.java

Run
5101   85    b  3       Test::test (23 bytes)
5137   86    b  4       Test::test (23 bytes)
Counted          Loop: N180/N156  counted [0,int),+1 (-1 iters) 
Loop: N0/N0  has_sfpt
  Loop: N179/N178  limit_check profile_predicated predicated
    Loop: N180/N156  limit_check profile_predicated predicated counted [0,int),+1 (-1 iters)  has_sfpt strip_mined
Predicate RC     Loop: N180/N156  limit_check profile_predicated predicated counted [0,int),+1 (9998 iters)  has_sfpt rce strip_mined
Loop: N0/N0  has_sfpt
  Loop: N179/N178  limit_check profile_predicated predicated sfpts={ 181 }
    Loop: N180/N156  limit_check profile_predicated predicated counted [0,int),+1 (9998 iters)  has_sfpt strip_mined
PreMainPost      Loop: N180/N156  limit_check profile_predicated predicated counted [0,int),+1 (9998 iters)  has_sfpt strip_mined
Unroll 2         Loop: N180/N156  limit_check counted [int,int),+1 (9998 iters)  main has_sfpt strip_mined
Loop: N0/N0  has_sfpt
  Loop: N290/N295  limit_check profile_predicated predicated counted [0,int),+1 (4 iters)  pre
  Loop: N179/N178  limit_check sfpts={ 181 }
    Loop: N386/N156  limit_check counted [int,int),+2 (9998 iters)  main has_sfpt strip_mined
  Loop: N241/N246  limit_check counted [int,int),+1 (4 iters)  post
Unroll 4         Loop: N386/N156  limit_check counted [int,int),+2 (9998 iters)  main has_sfpt strip_mined
...
Unroll 8         Loop: N457/N156  limit_check counted [int,int),+4 (9998 iters)  main has_sfpt strip_mined
...
Unroll 16         Loop: N551/N156  limit_check counted [int,int),+8 (9998 iters)  main has_sfpt strip_mined
...
PredicatesOff
Loop: N0/N0  has_sfpt
  Loop: N290/N295  counted [0,int),+1 (4 iters)  pre
  Loop: N179/N178  limit_check sfpts={ 181 }
    Loop: N672/N156  limit_check counted [int,int),+16 (9998 iters)  main has_sfpt strip_mined
  Loop: N241/N246  limit_check counted [int,int),+1 (4 iters)  post

VTransform::apply:
    Loop: N672/N156  limit_check counted [int,int),+16 (9998 iters)  main has_sfpt strip_mined
 672  CountedLoop  === 672 179 156  [[ 623 626 629 632 635 638 642 645 672 521 675 684 524 528 531 447 444 143 380 175 ]] inner stride: 16 main of N672 strip mined !orig=[551],[457],[386],[180],[171],[91] !jvms: Test::test @ bci:8 (line 21)
Loop: N0/N0  has_sfpt
  Loop: N290/N295  predicated counted [0,int),+1 (4 iters)  pre
  Loop: N179/N178  limit_check sfpts={ 181 }
    Loop: N672/N156  limit_check counted [int,int),+16 (9998 iters)  main vector has_sfpt strip_mined
  Loop: N241/N246  limit_check counted [int,int),+1 (4 iters)  post
PostVector      Loop: N672/N156  limit_check counted [int,int),+16 (9998 iters)  main vector has_sfpt strip_mined
Unroll 32         Loop: N672/N156  limit_check counted [int,int),+16 (9998 iters)  main vector has_sfpt strip_mined
Loop: N0/N0  has_sfpt
  Loop: N290/N295  predicated counted [0,int),+1 (4 iters)  pre
  Loop: N179/N178  limit_check sfpts={ 181 }
    Loop: N832/N156  limit_check counted [int,int),+32 (9998 iters)  main vector has_sfpt strip_mined
  Loop: N786/N788  counted [int,int),+16 (16 iters)  post vector
  Loop: N241/N246  limit_check counted [int,int),+1 (4 iters)  post
Unroll 64         Loop: N832/N156  limit_check counted [int,int),+32 (9998 iters)  main vector has_sfpt strip_mined
...
Unroll 128         Loop: N898/N156  limit_check counted [int,int),+64 (9998 iters)  main vector has_sfpt strip_mined
Loop: N0/N0  has_sfpt
  Loop: N290/N295  predicated counted [0,int),+1 (4 iters)  pre
  Loop: N179/N178  limit_check sfpts={ 181 }
    Loop: N986/N156  limit_check counted [int,int),+128 (9998 iters)  main vector has_sfpt strip_mined
  Loop: N786/N788  counted [int,int),+16 (16 iters)  post vector
  Loop: N241/N246  limit_check counted [int,int),+1 (4 iters)  post
Done

Let us make a few observations about the logs above:

  • Counted Loop: N180/N156 counted [0,int),+1 (-1 iters)
    • We detect the loop as a counted loop. That is the prerequisite for many other loop optimizations. We see that it iterates over the range [0,int), i.e. that it starts at zero up to some limit.
  • Predicate RC Loop: N180/N156 limit_check profile_predicated predicated counted [0,int),+1 (9998 iters) has_sfpt rce strip_mined
    • RangeCheck elimination using predicates before the loop.
  • PreMainPost: generation of pre, main and post loops. The pre loop prepares for the main loop, the post loop cleans up any remaining iterations. This has at least two purposes: the pre loop can align vector memory accesses for the main loop, by changing the number of iterations spent in the pre loop. And in some cases RangeCheck elimination requires us to spend all iterations where the RangeCheck cannot be eliminated in the pre and post loop, ensuring that we do not need to execute RangeChecks in the main loop.
  • Unroll 2Unroll 16: we unroll the loop by doubling the number of iterations. Here, we pick an unrolling factor of 16 because I have an AVX512 machine that can fit 16 ints into a 64 byte vector register. On your machine this may differ, try it out!
  • VTransform::apply: this is a note from the auto-vectorizer, showing that we are applying the VTransform, i.e. we are replacing scalar operations with vector operations.
  • Loop: N672/N156 limit_check counted [int,int),+16 (9998 iters) main vector has_sfpt strip_mined
    • We see that the main loop is now annotated with vector, since it has been vectorized.
  • PostVector: we introduce a copy of the main loop, as the vectorized drain loop, that is useful when we later super-unroll the main loop.
  • Unroll 32Unroll 128: we further unroll the vectorized main loop, to saturate the CPU pipeline with more vector instructions.
  • Loop: N986/N156 limit_check counted [int,int),+128 (9998 iters) main vector has_sfpt strip_mined
    • The main loop now performs 16 * 8 = 128 iterations in one iteration.
  • Loop: N786/N788 counted [int,int),+16 (16 iters) post vector
    • The drain loop performs 16 iterations in an iteration.

More Details about Loop Optimizations

At first, we only execute three rounds of loop optimizations. Then we perform CCP + IGVN because the first loop optimizations such as peeling and unrolling can often allow us to narrow down types or constant fold control flow inside the loop. Then, we perform many more iterations of loop optimizations, until no loop can further be optimized or we hit a loop optimization round limit.

Here an overview for PhaseIdealLoop::build_and_optimize:

  • build_loop_tree: analyze the loop structures: detect loops and compute dominator information.
  • beautify_loops: canonicalize the loop structures, needs to rerun build_loop_tree afterwards.
  • build_loop_early: For a data node, we compute the earliest control node possible, i.e. the “highest” placement in the graph.
  • counted_loop: calls is_counted_loop which tries to convert LoopNode loops to CountedLoopNode loops, i.e. they are converted to a canonical counted loop shape. A critical guarantee for counted loops is that the iv phi will never overflow at runtime.
  • eliminate_useless_zero_trip_guard: If a post or main loop is removed due to an assert predicate, the opaque that guards the loop is not needed anymore.
  • process_expensive_nodes: “expensive” nodes have a control input to force them to be only on certain paths and not slow down others. We need loop information to common up “expensive” nodes while not slowing down any path.
  • eliminate_useless_predicates: Remove predicates that we know will never be used.
  • do_maximally_unroll: take steps towards full unrolling of the loop, if the exact trip-count is known and low enough.
  • bs->optimize_loops: GC barrier optimizations, skip.
  • reassociate_invariants: reassociate invariants (a canonicalization) in preparation for split_thru_phi (later optimization).
  • split_if_with_blocks: This is a large “grab-bag” set of split-trough optimizations, including split_thru_phi (e.g. add(phi(x,y), z) -> phi(add(x, z), add(y, z))).
  • loop_predication: Insert hoisted check predicates for null checks and range checks etc.
  • do_intrinsify_fill: Detect loop shapes that fill arrays, replace them with a call to an array fill intrinsic.
  • iteration_split: perform various iteration splitting transformations:
    • compute_trip_count: Compute the trip count if possible.
    • do_one_iteration_loop: Convert a one iteration loop into straight-line code, i.e. remove the loop structure.
    • do_remove_empty_loop: remove loops that have nothing in the loop body.
    • partial_peel: partially peel the top portion of a loop, copying it to the loop entry and the backedge. This “rotates” the loop. The goal is to bring the exit test to the end of the loop, and to create a canonical (possibly counted) loop form.
    • do_peeling: Peel the first iteration of a loop. Make a single-iteration copy of the loop, which is executed as straight-line code before the loop. That allows for some invariant checks to only be executed in the peeled iteration, and removed from the loop body.
    • do_unswitching: move a loop-invariant check before the loop, copy the loop: the check inside the loops can fold to true in the true-loop and to false in the false-loop. This effectively removes that check from the loop, hoisting it before the loop.
    • duplicate_loop_backedge: Sometimes multiple paths inside the loop merge (region) just before the exit check and backedge. It can be beneficial to split this region and duplicate the exit check, so that an inner (hopefully counted) loop and an outer loop are created.
    • create_loop_nest: Convert long counted loops to int counted loops. If possible, convert the long RangeChecks to int RangeChecks, which makes RangeCheck Elimination easier later on.
    • compute_profile_trip_cnt: Compute the loop trip count from profile data.
    • insert_pre_post_loops: Convert a normal counted loop into a pre, main, and post loop structure.
      • The pre loop is used if strict alignment is required for vector memory accesses in the main loop: we keep iterating in the pre loop until we achieve alignment.
      • The main loop is unrolled and possibly vectorized.
      • The post loop handles any remaining iterations after the unrolled main loop.
      • The pre and post loop can also be used to do RangeCheck Elimination: we remove the RangeCheck for the main loop, and execute any iterations where the RangeCheck could be out of bounds in the pre and post loop.
    • do_range_check: Remove RangeChecks and other trip-counter vs loop-invariant tests. We do this using predicates before the loop, which deopt if they fail.
    • insert_vector_post_loop: insert the vectorized drain loop, so we can super-unroll the vectorized main loop, and the vectorized drain loop can handle remaining iterations faster than just the post loop.
  • auto_vectorize: Auto-vectorize loops (must already have been pre-main-post-ed, and unrolled).
  • mark_parse_predicate_nodes_useless: Up to now, all optimizations with predicates should have been performed. Remove the predicates, and maybe some loop optimizations that do not need predicates are now unlocked.
  • Note: major_progress gets set as soon as a loop optimization is performed that breaks the loop structure, and would require IGVN to be rerun to clean the graph, and build_loop_tree to recompute the loop structures.

TODO Talk about get / set ctrl “side-table”. Probably also idom etc. These are the basic loop structure data structures. And about major progress.

Continue with Part 5

Please leave a comment below

To edit/delete a comment: click on the date above your comment, e.g. just now or 5 minutes ago. This takes you to the GitHub issue page associated with this blog post. Find your comment, and edit/delete it by clicking the three dots ... on the top right.