Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v4]

Galder Zamarreño Thu, 17 Oct 2024 03:15:39 -0700

On Thu, 17 Oct 2024 10:10:56 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:


>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in 
>> order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of 
>> these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for 
>> MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can 
>> carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt 
>> strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] 
>> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] 
>> !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in 
>> https://github.com/openjdk/jdk/pull/13056, we can compare the results before 
>> and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the 
>> MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into 
>> CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these 
>> results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 30 additional 
> commits since the last revision:
> 
>  - Use same default size as in other vector reduction benchmarks
>  - Renamed benchmark class
>  - Double/Float tests only when avx enabled
>  - Make state class non-final
>  - Restore previous benchmark iterations and default param size
>  - Add clipping range benchmark that uses min/max
>  - Encapsulate benchmark state within an inner class
>  - Avoid creating result array in benchmark method
>  - Merge branch 'master' into topic.intrinsify-max-min-long
>  - Revert "Implement cmovL as a jump+mov branch"
>    
>    This reverts commit 1522e26bf66c47b780ebd0d0d0c4f78a4c564e44.
>  - ... and 20 more: https://git.openjdk.org/jdk/compare/52005a12...0a8718e1

I've re-run the benchmarks in non-AVX-512 and AVX-512 environments making sure 
no .ad changes were applied.
I've also added clipping range benchmarks suggested by @theRealAph.

Remember that the AVX512 and non-AVX512 results were obtained in different 
systems so they cannot be compared between them. AVX512 results can be compared 
between base and patched versions and same for non-AVX512 results.

The results for loop* and reduction* match the behaviour explained in 
https://github.com/openjdk/jdk/pull/20098#issuecomment-2379386872. The 
explanation in that comment applies here as well:


Benchmark                          (probability)  (range)  (seed)  (size)   
Mode  Cnt    Score    Error   Units
MinMaxLoopBench.longReductionMax              50      N/A     N/A   10000  
thrpt    8  107.441 ±  0.092  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax              80      N/A     N/A   10000  
thrpt    8  107.431 ±  0.057  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax             100      N/A     N/A   10000  
thrpt    8  213.200 ±  5.070  ops/ms (non-AVX512, base)
MinMaxLoopBench.longReductionMax              50      N/A     N/A   10000  
thrpt    8  107.411 ±  0.088  ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax              80      N/A     N/A   10000  
thrpt    8  107.425 ±  0.097  ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax             100      N/A     N/A   10000  
thrpt    8  107.377 ±  0.075  ops/ms (non-AVX512, patch)
MinMaxLoopBench.longReductionMax              50      N/A     N/A   10000  
thrpt    8  414.214 ±  0.898  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax              80      N/A     N/A   10000  
thrpt    8  414.637 ±  0.074  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax             100      N/A     N/A   10000  
thrpt    8  239.570 ±  3.034  ops/ms (AVX512, base)
MinMaxLoopBench.longReductionMax              50      N/A     N/A   10000  
thrpt    8  414.276 ±  0.399  ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax              80      N/A     N/A   10000  
thrpt    8  414.284 ±  0.342  ops/ms (AVX512, patch)
MinMaxLoopBench.longReductionMax             100      N/A     N/A   10000  
thrpt    8  413.860 ±  1.831  ops/ms (AVX512, patch)


The clipping range results show big improvements:

Benchmark                          (probability)  (range)  (seed)  (size)   
Mode  Cnt    Score    Error    Units
MinMaxLoopBench.longClippingRange            N/A       90       0   10000  
thrpt    8  108.503 ±  0.399   ops/ms (non-AVX512, base)
MinMaxLoopBench.longClippingRange            N/A      100       0   10000  
thrpt    8  107.655 ±  1.759   ops/ms (non-AVX512, base)
MinMaxLoopBench.longClippingRange            N/A       90       0   10000  
thrpt    8  613.310 ±  1.140   ops/ms (non-AVX512, patch)
MinMaxLoopBench.longClippingRange            N/A      100       0   10000  
thrpt    8  613.282 ±  0.744   ops/ms (non-AVX512, patch)
MinMaxLoopBench.longClippingRange            N/A       90       0   10000  
thrpt    8   64.343 ±  0.396   ops/ms (AVX512, base)
MinMaxLoopBench.longClippingRange            N/A      100       0   10000  
thrpt    8   61.323 ±  6.059   ops/ms (AVX512, base)
MinMaxLoopBench.longClippingRange            N/A       90       0   10000  
thrpt    8  359.525 ±  0.570   ops/ms (AVX512, patch)
MinMaxLoopBench.longClippingRange            N/A      100       0   10000  
thrpt    8  360.284 ±  1.408   ops/ms (AVX512, patch)


The improvements in clipping range are due to vector instructions being used:


   0.11%  ││  0x00007f5e000266c8:   vpcmpgtq            %ymm4, %ymm5, %ymm12
   0.56%  ││  0x00007f5e000266cd:   vblendvpd           %ymm12, %ymm5, %ymm4, 
%ymm12
   0.04%  ││  0x00007f5e000266d3:   vpcmpgtq            %ymm6, %ymm12, %ymm11
   1.10%  ││  0x00007f5e000266d8:   vblendvpd           %ymm11, %ymm6, %ymm12, 
%ymm11
   2.93%  ││  0x00007f5e000266de:   vmovdqu             %ymm11, 0xf0(%r9, %r10, 
8)
          ││                                                            
;*lastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange@35 (line 211)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)


Whereas without the changes it uses scalar instructions:


   0.56%          │↗    0x00007f9e98025e83:   cmpq              %r8, %rdx
   2.98%  ╭       ││    0x00007f9e98025e86:   jle               0x7f9e98025e8b  
    ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
          │       ││                                                            
  ; - java.lang.Math::min@3 (line 2132)
          │       ││                                                            
  ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange@32 (line 
211)
          │       ││                                                            
  ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   0.03%  │       ││    0x00007f9e98025e88:   movq              %r8, %rdx       
    ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │       ││                                                            
  ; - java.lang.Math::min@11 (line 2132)
          │       ││                                                            
  ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange@32 (line 
211)
          │       ││                                                            
  ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   0.04%  ↘       ││    0x00007f9e98025e8b:   movq              %rdx, 
0x28(%r13, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
                  ││                                                            
  ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange@35 (line 
211)
                  ││                                                            
  ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
  19.79%          ││    0x00007f9e98025e90:   addl              $4, %ecx        
    ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                  ││                                                            
  ; - org.openjdk.bench.java.lang.MinMaxLoopBench::longClippingRange@36 (line 
210)
                  ││                                                            
  ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxLoopBench_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)


Finally, I've fixed the float/double IR tests by adding conditionals to make 
sure they only run when UseAVX > 0.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2419120069

Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v4]

Reply via email to