On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs > for X86 platforms [1]. However, the current implementation is not optimal for > AArch64 SVE platform, which natively supports vector instructions for subword > gather load operations using an int vector for indices (see [2][3]). > > Two key areas require improvement: > 1. At the Java level, vector indices generated for range validation could be > reused for the subsequent gather load operation on architectures with native > vector instructions like AArch64 SVE. However, the current implementation > prevents compiler reuse of these index vectors due to divergent control flow, > potentially impacting performance. > 2. At the compiler IR level, the additional `offset` input for > `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR > complexity and complicates backend implementation. Furthermore, generating > `add` instructions before each memory access negatively impacts performance. > > This patch refactors the implementation at both the Java level and compiler > mid-end to improve efficiency and maintainability across different > architectures. > > Main changes: > 1. Java-side API refactoring: > - Explicitly passes generated index vectors to hotspot, eliminating > duplicate index vectors for gather load instructions on > architectures like AArch64. > 2. C2 compiler IR refactoring: > - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword > types by removing the memory offset input and incorporating it into the > memory base `addr` at the IR level. This simplifies backend implementation, > reduces add operations, and unifies the IR across all types. > 3. Backend changes: > - Streamlines X86 implementation of subword gather operations following > the removal of the offset input from the IR level. > > Performance: > The performance of the relative JMH improves up to 27% on a X86 AVX512 > system. Please see the data below: > > Benchmark Mode Cnt Unit > SIZE Before After Gain > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 64 53682.012 52650.325 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 256 14484.252 14255.156 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 1024 3664.900 3595.615 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 4096 908.312 935.269 1.02 > GatherOperationsBenchmark.micr... Hi @eme64 , could you please help take a look at this PR, which is a part of https://github.com/openjdk/jdk/pull/24679 ? Thanks a lot in advance! Hi @jatin-bhateja , could you please kindly review this PR, especially the X86 codegen part? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2865493287 PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2865495716