Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Galder Zamarreño Wed, 19 Feb 2025 22:54:04 -0800

On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:


>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in 
>> order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of 
>> these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for 
>> MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can 
>> carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt 
>> strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] 
>> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] 
>> !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in 
>> https://github.com/openjdk/jdk/pull/13056, we can compare the results before 
>> and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the 
>> MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into 
>> CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these 
>> results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 44 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68

To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, 
I've run `MinMaxVector.int` benchmarks with **superword disabled** and 
with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes.

# AVX-512


Benchmark                             (probability)  (range)  (seed)  (size)   
Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  
thrpt    4   1067.050   1038.640  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  
thrpt    4   1041.922   1039.004  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  
thrpt    4    605.173    604.337  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  
thrpt    4    605.106    604.309  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  
thrpt    4    604.547    604.432  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  
thrpt    4    495.042    605.216  ops/ms (+22%)
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  
thrpt    4    495.105    495.217  ops/ms
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  
thrpt    4    495.040    495.176  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  
thrpt    4    407.920    407.984  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  
thrpt    4    407.710    407.965  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  
thrpt    4    874.881    407.922  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  
thrpt    4    407.911    407.947  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  
thrpt    4    408.015    408.024  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  
thrpt    4    407.978    407.994  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  
thrpt    4    460.538    460.439  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  
thrpt    4    460.579    460.542  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  
thrpt    4    998.211    460.404  ops/ms (-53%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  
thrpt    4    460.570    460.447  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  
thrpt    4    460.552    460.493  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  
thrpt    4    460.455    460.485  ops/ms


There is some improvement in `intLoopMin` @ 50% but this didn't materialize in 
the `perfasm` run, so I don't think it can be strictly be correlated with the 
use/not-use of the intrinsic.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with 
the `max` intrinsic activated are consistent with what the saw with long.

### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same 
performance

There is something very intriguing happening here, which I don't know it's due 
to min itself or int vs long. Basically, with or without the `min` intrinsic 
the performance of these 2 benchmarks is same at 100% branch probability. What 
is going on? Let's look at one of them:

-min

# VM options: 
-Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native
 -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord
...
   3.04%  ││││    │   0x00007f49280f76e9:   cmpl                %edi, %r10d
   3.14%  ││││    │   0x00007f49280f76ec:   cmovgl              %edi, %r10d     
    ;*ireturn {reexecute=0 rethrow=0 return_oop=0}
          ││││    │                                                             
; - java.lang.Math::min@10 (line 2119)
          ││││    │                                                             
; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin@23 (line 
212)
          ││││    │                                                             
; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub@19
 (line 124)


+min

# VM options: 
-Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native
 -XX:-UseSuperWord
...
   3.10%  ││      │   0x00007fbf340f6b97:   cmpl                %edi, %r10d
   3.08%  ││      │   0x00007fbf340f6b9a:   cmovgl              %edi, %r10d     
    ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
          ││      │                                                             
; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin@23 (line 
212)
          ││      │                                                             
; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub@19
 (line 124)


Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode 
gets executed and transformed into a `cmov` and the same with the intrinsic.

I will verify this with long shortly to see if this behaviour is specific to 
`min` operation or something to do with int vs long.

# AVX2

Here are the AVX2 numbers:


Benchmark                             (probability)  (range)  (seed)  (size)   
Mode  Cnt  -min/-max  +min/+max   Units
MinMaxVector.intClippingRange                   N/A       90       0    1000  
thrpt    4   1068.265   1039.087  ops/ms
MinMaxVector.intClippingRange                   N/A      100       0    1000  
thrpt    4   1067.705   1038.760  ops/ms
MinMaxVector.intLoopMax                          50      N/A     N/A    2048  
thrpt    4    605.015    604.364  ops/ms
MinMaxVector.intLoopMax                          80      N/A     N/A    2048  
thrpt    4    605.169    604.366  ops/ms
MinMaxVector.intLoopMax                         100      N/A     N/A    2048  
thrpt    4    604.527    604.494  ops/ms
MinMaxVector.intLoopMin                          50      N/A     N/A    2048  
thrpt    4    605.099    605.057  ops/ms
MinMaxVector.intLoopMin                          80      N/A     N/A    2048  
thrpt    4    495.071    605.080  ops/ms (+22%)
MinMaxVector.intLoopMin                         100      N/A     N/A    2048  
thrpt    4    495.134    495.047  ops/ms
MinMaxVector.intReductionMultiplyMax             50      N/A     N/A    2048  
thrpt    4    407.953    407.987  ops/ms
MinMaxVector.intReductionMultiplyMax             80      N/A     N/A    2048  
thrpt    4    407.861    408.005  ops/ms
MinMaxVector.intReductionMultiplyMax            100      N/A     N/A    2048  
thrpt    4    873.915    407.995  ops/ms (-53%)
MinMaxVector.intReductionMultiplyMin             50      N/A     N/A    2048  
thrpt    4    408.019    407.987  ops/ms
MinMaxVector.intReductionMultiplyMin             80      N/A     N/A    2048  
thrpt    4    407.971    408.009  ops/ms
MinMaxVector.intReductionMultiplyMin            100      N/A     N/A    2048  
thrpt    4    407.970    407.956  ops/ms
MinMaxVector.intReductionSimpleMax               50      N/A     N/A    2048  
thrpt    4    460.443    460.514  ops/ms
MinMaxVector.intReductionSimpleMax               80      N/A     N/A    2048  
thrpt    4    460.484    460.581  ops/ms
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  
thrpt    4   1015.601    460.446  ops/ms (-54%)
MinMaxVector.intReductionSimpleMin               50      N/A     N/A    2048  
thrpt    4    460.494    460.532  ops/ms
MinMaxVector.intReductionSimpleMin               80      N/A     N/A    2048  
thrpt    4    460.489    460.451  ops/ms
MinMaxVector.intReductionSimpleMin              100      N/A     N/A    2048  
thrpt    4   1021.420    460.435  ops/ms (-55%)


This time we see an improvement in `intLoopMin` @ 80% but again it was not 
observable in the `perfasm` run.

`intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, 
the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not 
have a regression for the same reasons above, both use cmov.

The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression 
there but I didn't observe it with the `perfasm` run. So, this could be due to 
variance in the application of `cmov` or not?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470

Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]

Reply via email to