On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in >> order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of >> these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for >> MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can >> carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt >> strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] >> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] >> !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in >> https://github.com/openjdk/jdk/pull/13056, we can compare the results before >> and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the >> MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into >> CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these >> results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarreño has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 44 additional > commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/abdd4f5e...a190ae68 > > Re: [#20098 > > (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) > > - I was trying to think what could be causing this. > > Maybe it is an issue with probabilities? Do you know at what point (if at > all) the `MinI` node appears/disappears in that example? The probabilities are fine. I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`. In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in. In the slow versions you see this `PrintMethodData`: static java.lang.Math::min(II)I interpreter_invocation_count: 18171 invocation_counter: 18171 backedge_counter: 0 decompile_count: 0 mdo size: 328 bytes 0 iload_0 1 iload_1 2 if_icmpgt 9 0 bci: 2 BranchData taken(7732) displacement(56) not taken(10180) 5 iload_0 6 goto 10 32 bci: 6 JumpData taken(10180) displacement(24) 9 iload_1 10 ireturn org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I interpreter_invocation_count: 189 invocation_counter: 189 backedge_counter: 313344 decompile_count: 0 mdo size: 384 bytes 0 iconst_0 1 istore_2 2 iconst_0 3 istore_3 4 iload_3 5 aload_1 6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I> 9 if_icmpge 33 0 bci: 9 BranchData taken(58) displacement(72) not taken(192512) 12 aload_1 13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I> 16 iload_3 17 iaload 18 istore #4 20 iload_2 21 fast_iload #4 23 invokestatic 32 <java/lang/Math.min(II)I> 32 bci: 23 CounterData count(192512) 26 istore_2 27 iinc #3 1 30 goto 4 48 bci: 30 JumpData taken(192512) displacement(-48) 33 iload_2 34 ireturn The benchmark method calls Math.min `192_512` times, yet the method data shows only `18_171` invocations, of which `7_732` are taken which is 42%. So it gets compiled with a `cmov` and the benchmark will be slow because it will branch 100% one of the sides. In the fast version, `PrintMethodData` looks like this: static java.lang.Math::min(II)I interpreter_invocation_count: 1575322 invocation_counter: 1575322 backedge_counter: 0 decompile_count: 0 mdo size: 368 bytes 0 iload_0 1 iload_1 2 if_icmpgt 9 0 bci: 2 BranchData taken(1418001) displacement(56) not taken(157062) 5 iload_0 6 goto 10 32 bci: 6 JumpData taken(157062) displacement(24) 9 iload_1 10 ireturn org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I interpreter_invocation_count: 858 invocation_counter: 858 backedge_counter: 1756214 decompile_count: 0 mdo size: 424 bytes 0 iconst_0 1 istore_2 2 iconst_0 3 istore_3 4 iload_3 5 aload_1 6 fast_igetfield 35 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.size:I> 9 if_icmpge 33 0 bci: 9 BranchData taken(733) displacement(72) not taken(1637363) 12 aload_1 13 fast_agetfield 41 <org/openjdk/bench/java/lang/MinMaxVector$LoopState.minIntA:[I> 16 iload_3 17 iaload 18 istore #4 20 iload_2 21 fast_iload #4 23 invokestatic 32 <java/lang/Math.min(II)I> 32 bci: 23 CounterData count(1637363) 26 istore_2 27 iinc #3 1 30 goto 4 48 bci: 30 JumpData taken(1637363) displacement(-48) 33 iload_2 34 ireturn The benchmark method calls Math.min `1_637_363` times, and the method data shows `1_575_322` invocations, of which `1_418_001` are taken which is 90%. So no cmov is introduced and the benchmark will be fast because it will branch 100% one of the sides. A factor here might be my Xeon machine. I run the benchmark on a 4 core VM inside it, so given the limited resources compilation can take longer. I've noticed that it's easier to replicate this scenario there rather than my M1 laptop, which has 10 cores. >> So, if those int scalar regressions were not a problem when int min/max >> intrinsic was added, I would expect the same to apply to long. > > Do you know when they were added? If that was a long time ago, we might not > have noticed back then, but we might notice now. I don't know when they were added. > That said: if we know that it is only in the high-probability cases, then we > can address those separately. I would not consider it a blocking issue, as > long as we file the follow-up RFE for int/max scalar case with high branch > probability. > > What would be really helpful: a list of all regressions / issues, and how we > intend to deal with them. If we later find a regression that someone cares > about, then we can come back to that list, and justify the decision we made > here. I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2684701935