On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs > for X86 platforms [1]. However, the current implementation is not optimal for > AArch64 SVE platform, which natively supports vector instructions for subword > gather load operations using an int vector for indices (see [2][3]). > > Two key areas require improvement: > 1. At the Java level, vector indices generated for range validation could be > reused for the subsequent gather load operation on architectures with native > vector instructions like AArch64 SVE. However, the current implementation > prevents compiler reuse of these index vectors due to divergent control flow, > potentially impacting performance. > 2. At the compiler IR level, the additional `offset` input for > `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR > complexity and complicates backend implementation. Furthermore, generating > `add` instructions before each memory access negatively impacts performance. > > This patch refactors the implementation at both the Java level and compiler > mid-end to improve efficiency and maintainability across different > architectures. > > Main changes: > 1. Java-side API refactoring: > - Explicitly passes generated index vectors to hotspot, eliminating > duplicate index vectors for gather load instructions on > architectures like AArch64. > 2. C2 compiler IR refactoring: > - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword > types by removing the memory offset input and incorporating it into the > memory base `addr` at the IR level. This simplifies backend implementation, > reduces add operations, and unifies the IR across all types. > 3. Backend changes: > - Streamlines X86 implementation of subword gather operations following > the removal of the offset input from the IR level. > > Performance: > The performance of the relative JMH improves up to 27% on a X86 AVX512 > system. Please see the data below: > > Benchmark Mode Cnt Unit > SIZE Before After Gain > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 64 53682.012 52650.325 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 256 14484.252 14255.156 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 1024 3664.900 3595.615 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms > 4096 908.312 935.269 1.02 > GatherOperationsBenchmark.micr... It seems reasonable. @jatin-bhateja should definitively look over this, or someone else from Intel who knows this code well :) I'll launch some testing now :) src/hotspot/share/opto/vectorIntrinsics.cpp line 1203: > 1201: // Class<? extends Vector<Integer>> vectorIndexClass, > int indexLength, > 1202: // Object base, long offset, > 1203: // W indexVector1, W index_vector2, W index_vector3, W > index_vector4, Are we doing a mix of `CamelCase` and `with_underscore` on purpose? src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line 495: > 493: Class<? extends Vector<Integer>> vectorIndexClass, > 494: int indexLength, Object base, long offset, > 495: W indexVector1, W indexVector2, W indexVector3, W > indexVector4, Here you went for all `camelCase`, just an observation :) ------------- PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2867447489 PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106674102 PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106673987