Re: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API

Xiaohong Gong Mon, 19 May 2025 19:25:45 -0700

On Mon, 19 May 2025 03:10:46 GMT, Xiaohong Gong <[email protected]> wrote:


>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs 
>> for X86 platforms [1]. However, the current implementation is not optimal 
>> for AArch64 SVE platform, which natively supports vector instructions for 
>> subword gather load operations using an int vector for indices (see [2][3]).
>> 
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be 
>> reused for the subsequent gather load operation on architectures with native 
>> vector instructions like AArch64 SVE. However, the current implementation 
>> prevents compiler reuse of these index vectors due to divergent control 
>> flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for 
>> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR 
>> complexity and complicates backend implementation. Furthermore, generating 
>> `add` instructions before each memory access negatively impacts performance.
>> 
>> This patch refactors the implementation at both the Java level and compiler 
>> mid-end to improve efficiency and maintainability across different 
>> architectures.
>> 
>> Main changes:
>> 1. Java-side API refactoring:
>>    - Explicitly passes generated index vectors to hotspot, eliminating 
>> duplicate index vectors for gather load instructions on
>>      architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword 
>> types by removing the memory offset input and incorporating it into the 
>> memory base `addr` at the IR level. This simplifies backend implementation, 
>> reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>>    - Streamlines X86 implementation of subword gather operations following 
>> the removal of the offset input from the IR level.
>> 
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512 
>> system. Please see the data below:
>> 
>> Benchmark                                                 Mode   Cnt Unit    
>> SIZE    Before      After    Gain
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 64    53682.012   52650.325  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 256   14484.252   14255.156  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 1024   3664.900    3595.615  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
>> 4096    908.31...
>
> Ping again~  could any one please take a look at this PR？ Thanks a lot!

> Hi @XiaohongGong , Very nice work!, Looks good to me, will do some testing 
> and get back.
> 
> Do you have any idea about following regression?
> 
> ```
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
> 64    55844.814   48311.847  0.86
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
> 256   15139.459   13009.848  0.85
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
> 1024   3861.834    3284.944  0.85
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
> 4096    938.665     817.673  0.87
> ```
> 
> Best Regards

Yes, I also observed such regression. After analyzing, I found it was caused by 
the java side changes, which influences the range check elimination inside 
`IntVector.fromArray()` and `ByteVector.intoArray()` in the benchmark. The root 
cause is the counted loop in following benchmark case is not recognized by 
compiler as expected:

public void microByteGather256() {
        for (int i = 0; i < SIZE; i += B256.length()) {
            ByteVector.fromArray(B256, barr, 0, index, i)
                            .intoArray(bres, i);
        }
 }
``` 
The loop iv phi node is not recognized successfully when C2 recognize the 
counted loop pattern, because it was casted twice with `CastII` in this case. 
The ideal graph looks like:

                           Loop
                               \
                                    \        / -------------------------
                                       \  /                                     
   |
                                       Phi                                      
  |
                                         |                                      
    | 
                                      CastII                                    
 |
                                          |                                     
    |
                                      CastII                                    
 |
                                          |                                     
    |
                                             \       ConI                       
|
                                                 \     |                        
    |
                                                AddVI                          |
                                                      |--------------------|    
                       

Relative code is 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1667.
 

Before the change, the graph should be:

                           Loop
                               \
                                    \        / -------------------------
                                       \  /                                     
   |
                                       Phi                                      
  |
                                         |                                      
    | 
                                      CastII                                    
 |
                                          |                                     
    |                                       |
                                             \       ConI                       
|
                                                 \     |                        
    |
                                                AddVI                          |
                                                      |--------------------|    
``` 
The difference comes from the index generation in `ByteVector.fromArray()` (I 
mean calling of `IntVector.fromArray()` in java). Before, the `i` in above loop 
is not directly used by `IntVector.fromArray()`. Instead, it was used by 
another loop phi node. Hence, there is no additional `CastII`. But after my 
change, the loop in the java implementation of `ByteVector.fromArray()` is 
removed, and `i` is directly used by `IntVector.fromArray()` . It will be used 
by boundary check before loading the indexes. Hence another `CastII` is 
generated.

When recognizing the loop iv phi node, it will check whether the `Phi` is used 
by a `CastII` first. And get the input of the `CastII` if then. But this check 
only happens once. See 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1659

  if (xphi->Opcode() == Op_Cast(iv_bt)) {
    xphi = xphi->in(1);
  }


Once the `PhiNode` is casted more than one times, the pattern is not 
recognized. I think we should refine the logic of recognizing the counted loop, 
by changing above `if` to `while` to make the iv phi node is recognized 
successfully. Potential change should be:

  while (xphi->Opcode() == Op_Cast(iv_bt)) {
    xphi = xphi->in(1);
  }


I'v tested this change, and found the benchmarks with regression can be 
improved as before. Consider I'm not familiar with C2's loop transform code, I 
prefer to do more investigation for this issue, and may fix it with a 
followed-up patch. Any suggestions?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2892720654

Re: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API

Reply via email to