On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:

>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in 
>> order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of 
>> these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for 
>> MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can 
>> carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt 
>> strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] 
>> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] 
>> !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in 
>> https://github.com/openjdk/jdk/pull/13056, we can compare the results before 
>> and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the 
>> MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into 
>> CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these 
>> results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 44 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Fix typo
>  - Renaming methods and variables and add docu on algorithms
>  - Fix copyright years
>  - Make sure it runs with cpus with either avx512 or asimd
>  - Test can only run with 256 bit registers or bigger
>    
>    * Remove platform dependant check
>    and use platform independent configuration instead.
>  - Fix license header
>  - Tests should also run on aarch64 asimd=true envs
>  - Added comment around the assertions
>  - Adjust min/max identity IR test expectations after changes
>  - ... and 34 more: https://git.openjdk.org/jdk/compare/75abfbc2...a190ae68

Following our discussion, I've run `MinMaxVector.long` benchmarks with 
superword disabled and with/without `_maxL` intrinsic in both AVX-512 and AVX2 
modes.

The first thing I've observed is that lacking superword, the results with 
AVX-512 or AVX2 are identical, so I will just focus on AVX-512 results below.


Benchmark                              (probability)  (range)  (seed)  (size)   
Mode  Cnt     -maxL     +maxLr   Units
MinMaxVector.longClippingRange                   N/A       90       0    1000  
thrpt    4  1012.017  1011.8109  ops/ms
MinMaxVector.longClippingRange                   N/A      100       0    1000  
thrpt    4  1012.113  1011.9530  ops/ms
MinMaxVector.longLoopMax                          50      N/A     N/A    2048  
thrpt    4   463.946   473.9408  ops/ms
MinMaxVector.longLoopMax                          80      N/A     N/A    2048  
thrpt    4   465.391   473.8063  ops/ms
MinMaxVector.longLoopMax                         100      N/A     N/A    2048  
thrpt    4   510.992   471.6280  ops/ms (-8%)
MinMaxVector.longLoopMin                          50      N/A     N/A    2048  
thrpt    4   496.036   495.3142  ops/ms
MinMaxVector.longLoopMin                          80      N/A     N/A    2048  
thrpt    4   495.797   497.1214  ops/ms
MinMaxVector.longLoopMin                         100      N/A     N/A    2048  
thrpt    4   495.302   495.1535  ops/ms
MinMaxVector.longReductionMultiplyMax             50      N/A     N/A    2048  
thrpt    4   405.495   405.3936  ops/ms
MinMaxVector.longReductionMultiplyMax             80      N/A     N/A    2048  
thrpt    4   405.342   405.4505  ops/ms
MinMaxVector.longReductionMultiplyMax            100      N/A     N/A    2048  
thrpt    4   846.492   405.4779  ops/ms (-52%)
MinMaxVector.longReductionMultiplyMin             50      N/A     N/A    2048  
thrpt    4   414.755   414.7036  ops/ms
MinMaxVector.longReductionMultiplyMin             80      N/A     N/A    2048  
thrpt    4   414.705   414.7093  ops/ms
MinMaxVector.longReductionMultiplyMin            100      N/A     N/A    2048  
thrpt    4   414.761   414.7150  ops/ms
MinMaxVector.longReductionSimpleMax               50      N/A     N/A    2048  
thrpt    4   460.435   460.3764  ops/ms
MinMaxVector.longReductionSimpleMax               80      N/A     N/A    2048  
thrpt    4   460.438   460.4718  ops/ms
MinMaxVector.longReductionSimpleMax              100      N/A     N/A    2048  
thrpt    4  1023.005   460.5417  ops/ms (-55%)
MinMaxVector.longReductionSimpleMin               50      N/A     N/A    2048  
thrpt    4   459.184   459.1662  ops/ms
MinMaxVector.longReductionSimpleMin               80      N/A     N/A    2048  
thrpt    4   459.265   459.2588  ops/ms
MinMaxVector.longReductionSimpleMin              100      N/A     N/A    2048  
thrpt    4   459.263   459.1304  ops/ms


`longLoopMax@100%`, `longReductionMultiplyMax@100%` and 
`longReductionSimpleMax@100%` are regressions with the `_maxL` intrinsic. The 
cause is familiar: without the intrinsic cmp+mov are emitted, while with the 
intrinsic and conditions above, `cmov` is emitted:

# `longLoopMax` @ 100%

-maxL:

   4.18%  ││││  │││   │           0x00007fb7580f84b2:   cmpq            %r13, 
%r11
          ││││╭ │││   │           0x00007fb7580f84b5:   jl              
0x7fb7580f84ec      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││ │││   │                                                         
            ; - java.lang.Math::max@11 (line 2038)
          │││││ │││   │                                                         
            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@27 (line 
256)
          │││││ │││   │                                                         
            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)
   4.23%  │││││ │││↗  │           0x00007fb7580f84bb:   movq            %r11, 
0x10(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││  │                                                         
            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@30 (line 
256)
          │││││ ││││  │                                                         
            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)


+maxL:

   1.06%  │││  0x00007fe1b40f5ed1:   movq               0x20(%rbx, %r10, 8), 
%r14;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@26 (line 256)
          │││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)
   1.34%  │││  0x00007fe1b40f5ed6:   cmpq               %r14, %r9
   2.78%  │││  0x00007fe1b40f5ed9:   cmovlq             %r14, %r9
   2.58%  │││  0x00007fe1b40f5edd:   movq               %r9, 0x20(%rax, %r10, 
8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@30 (line 256)
          │││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)


# `longReductionMultiplyMax` @ 100%

-maxL:

   6.71%  ││  ││↗    0x00007f8af40f6278:   imulq                $0xb, 
0x18(%r14, %r8, 8), %rdx
          ││  │││                                                              
;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││  │││                                                              
; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax@24 (line 
285)
          ││  │││                                                              
; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub@19
 (line 124)
   5.28%  ││  │││    0x00007f8af40f627e:   nop
  10.23%  ││  │││    0x00007f8af40f6280:   cmpq         %rdx, %rdi
          ││╭ │││    0x00007f8af40f6283:   jge          0x7f8af40f62a7      
;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││ │││                                                              
; - java.lang.Math::max@11 (line 2038)
          │││ │││                                                              
; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax@30 (line 
286)
          │││ │││                                                              
; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub@19
 (line 124)


+maxL:

  11.07%  ││  0x00007f47000f5c4d:   imulq               $0xb, 0x18(%r14, %r11, 
8), %rax
          ││                                                            ;*lmul 
{reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax@24 (line 285)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub@19
 (line 124)
   0.07%  ││  0x00007f47000f5c53:   cmpq                %rdx, %rax
  11.87%  ││  0x00007f47000f5c56:   cmovlq              %rdx, %rax          
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax@30 (line 286)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub@19
 (line 124)


# `longReductionSimpleMax` @ 100%

-maxL:

   5.71%  │││││     │││↗      │             0x00007fc2380f75f9:   movq          
0x20(%r14, %r8, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││││     ││││      │                                                 
                      ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax@20 (line 295)
          │││││     ││││      │                                                 
                      ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub@19
 (line 124)
   1.85%  │││││     ││││      │             0x00007fc2380f75fe:   nop
   4.52%  │││││     ││││      │             0x00007fc2380f7600:   cmpq          
%rdi, %rdx
          │││││╭    ││││      │             0x00007fc2380f7603:   jge           
0x7fc2380f7667      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││││││    ││││      │                                                 
                      ; - java.lang.Math::max@11 (line 2038)
          ││││││    ││││      │                                                 
                      ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax@26 (line 296)
          ││││││    ││││      │                                                 
                      ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub@19
 (line 124)


+maxL:

   3.06%   ││││││  0x00007fa6d00f6020:   movq           0x70(%r14, %r11, 8), 
%r8;*laload {reexecute=0 rethrow=0 return_oop=0}
           ││││││                                                            ; 
- org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax@20 (line 295)
           ││││││                                                            ; 
- 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub@19
 (line 124)
           ││││││  0x00007fa6d00f6025:   cmpq           %r8, %r13
   2.88%   ││││││  0x00007fa6d00f6028:   cmovlq         %r8, %r13           
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
           ││││││                                                            ; 
- org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax@26 (line 296)
           ││││││                                                            ; 
- 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub@19
 (line 124)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669329851

Reply via email to