On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:
>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in >> order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of >> these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for >> MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can >> carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt >> strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] >> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] >> !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in >> https://github.com/openjdk/jdk/pull/13056, we can compare the results before >> and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the >> MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into >> CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these >> results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarreño has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 44 additional > commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68 To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, I've run `MinMaxVector.int` benchmarks with **superword disabled** and with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes. # AVX-512 Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1067.050 1038.640 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1041.922 1039.004 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.173 604.337 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.106 604.309 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.547 604.432 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 495.042 605.216 ops/ms (+22%) MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.105 495.217 ops/ms MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.040 495.176 ops/ms MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.920 407.984 ops/ms MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.710 407.965 ops/ms MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 874.881 407.922 ops/ms (-53%) MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 407.911 407.947 ops/ms MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 408.015 408.024 ops/ms MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.978 407.994 ops/ms MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.538 460.439 ops/ms MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.579 460.542 ops/ms MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 998.211 460.404 ops/ms (-53%) MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.570 460.447 ops/ms MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.552 460.493 ops/ms MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 460.455 460.485 ops/ms There is some improvement in `intLoopMin` @ 50% but this didn't materialize in the `perfasm` run, so I don't think it can be strictly be correlated with the use/not-use of the intrinsic. `intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with the `max` intrinsic activated are consistent with what the saw with long. ### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same performance There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Basically, with or without the `min` intrinsic the performance of these 2 benchmarks is same at 100% branch probability. What is going on? Let's look at one of them: -min # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord ... 3.04% ││││ │ 0x00007f49280f76e9: cmpl %edi, %r10d 3.14% ││││ │ 0x00007f49280f76ec: cmovgl %edi, %r10d ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ││││ │ ; - java.lang.Math::min@10 (line 2119) ││││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin@23 (line 212) ││││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub@19 (line 124) +min # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord ... 3.10% ││ │ 0x00007fbf340f6b97: cmpl %edi, %r10d 3.08% ││ │ 0x00007fbf340f6b9a: cmovgl %edi, %r10d ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ││ │ ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin@23 (line 212) ││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub@19 (line 124) Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode gets executed and transformed into a `cmov` and the same with the intrinsic. I will verify this with long shortly to see if this behaviour is specific to `min` operation or something to do with int vs long. # AVX2 Here are the AVX2 numbers: Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1068.265 1039.087 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1067.705 1038.760 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.015 604.364 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.169 604.366 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.527 604.494 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 605.099 605.057 ops/ms MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.071 605.080 ops/ms (+22%) MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.134 495.047 ops/ms MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.953 407.987 ops/ms MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.861 408.005 ops/ms MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 873.915 407.995 ops/ms (-53%) MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 408.019 407.987 ops/ms MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 407.971 408.009 ops/ms MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.970 407.956 ops/ms MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.443 460.514 ops/ms MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.484 460.581 ops/ms MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 1015.601 460.446 ops/ms (-54%) MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.494 460.532 ops/ms MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.489 460.451 ops/ms MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 1021.420 460.435 ops/ms (-55%) This time we see an improvement in `intLoopMin` @ 80% but again it was not observable in the `perfasm` run. `intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not have a regression for the same reasons above, both use cmov. The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression there but I didn't observe it with the `perfasm` run. So, this could be due to variance in the application of `cmov` or not? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470