On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs 
> for X86 platforms [1]. However, the current implementation is not optimal for 
> AArch64 SVE platform, which natively supports vector instructions for subword 
> gather load operations using an int vector for indices (see [2][3]).
> 
> Two key areas require improvement:
> 1. At the Java level, vector indices generated for range validation could be 
> reused for the subsequent gather load operation on architectures with native 
> vector instructions like AArch64 SVE. However, the current implementation 
> prevents compiler reuse of these index vectors due to divergent control flow, 
> potentially impacting performance.
> 2. At the compiler IR level, the additional `offset` input for 
> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR 
> complexity and complicates backend implementation. Furthermore, generating 
> `add` instructions before each memory access negatively impacts performance.
> 
> This patch refactors the implementation at both the Java level and compiler 
> mid-end to improve efficiency and maintainability across different 
> architectures.
> 
> Main changes:
> 1. Java-side API refactoring:
>    - Explicitly passes generated index vectors to hotspot, eliminating 
> duplicate index vectors for gather load instructions on
>      architectures like AArch64.
> 2. C2 compiler IR refactoring:
>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword 
> types by removing the memory offset input and incorporating it into the 
> memory base `addr` at the IR level. This simplifies backend implementation, 
> reduces add operations, and unifies the IR across all types.
> 3. Backend changes:
>    - Streamlines X86 implementation of subword gather operations following 
> the removal of the offset input from the IR level.
> 
> Performance:
> The performance of the relative JMH improves up to 27% on a X86 AVX512 
> system. Please see the data below:
> 
> Benchmark                                                 Mode   Cnt Unit    
> SIZE    Before      After    Gain
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 64    53682.012   52650.325  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 256   14484.252   14255.156  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 1024   3664.900    3595.615  0.98
> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
> 4096    908.312     935.269  1.02
> GatherOperationsBenchmark.micr...

It seems reasonable. @jatin-bhateja should definitively look over this, or 
someone else from Intel who knows this code well :)

I'll launch some testing now :)

src/hotspot/share/opto/vectorIntrinsics.cpp line 1203:

> 1201: //                 Class<? extends Vector<Integer>> vectorIndexClass, 
> int indexLength,
> 1202: //                 Object base, long offset,
> 1203: //                 W indexVector1, W index_vector2, W index_vector3, W 
> index_vector4,

Are we doing a mix of `CamelCase` and `with_underscore` on purpose?

src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line 495:

> 493:                   Class<? extends Vector<Integer>> vectorIndexClass,
> 494:                   int indexLength, Object base, long offset,
> 495:                   W indexVector1, W indexVector2, W indexVector3, W 
> indexVector4,

Here  you went for all `camelCase`, just an observation :)

-------------

PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2867447489
PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106674102
PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106673987

Reply via email to