On Mon, 19 May 2025 03:10:46 GMT, Xiaohong Gong <[email protected]> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs
>> for X86 platforms [1]. However, the current implementation is not optimal
>> for AArch64 SVE platform, which natively supports vector instructions for
>> subword gather load operations using an int vector for indices (see [2][3]).
>>
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be
>> reused for the subsequent gather load operation on architectures with native
>> vector instructions like AArch64 SVE. However, the current implementation
>> prevents compiler reuse of these index vectors due to divergent control
>> flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for
>> `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR
>> complexity and complicates backend implementation. Furthermore, generating
>> `add` instructions before each memory access negatively impacts performance.
>>
>> This patch refactors the implementation at both the Java level and compiler
>> mid-end to improve efficiency and maintainability across different
>> architectures.
>>
>> Main changes:
>> 1. Java-side API refactoring:
>> - Explicitly passes generated index vectors to hotspot, eliminating
>> duplicate index vectors for gather load instructions on
>> architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword
>> types by removing the memory offset input and incorporating it into the
>> memory base `addr` at the IR level. This simplifies backend implementation,
>> reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>> - Streamlines X86 implementation of subword gather operations following
>> the removal of the offset input from the IR level.
>>
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512
>> system. Please see the data below:
>>
>> Benchmark Mode Cnt Unit
>> SIZE Before After Gain
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms
>> 64 53682.012 52650.325 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms
>> 256 14484.252 14255.156 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms
>> 1024 3664.900 3595.615 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms
>> 4096 908.31...
>
> Ping again~ could any one please take a look at this PR? Thanks a lot!
> Hi @XiaohongGong , Very nice work!, Looks good to me, will do some testing
> and get back.
>
> Do you have any idea about following regression?
>
> ```
> GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms
> 64 55844.814 48311.847 0.86
> GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms
> 256 15139.459 13009.848 0.85
> GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms
> 1024 3861.834 3284.944 0.85
> GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms
> 4096 938.665 817.673 0.87
> ```
>
> Best Regards
Yes, I also observed such regression. After analyzing, I found it was caused by
the java side changes, which influences the range check elimination inside
`IntVector.fromArray()` and `ByteVector.intoArray()` in the benchmark. The root
cause is the counted loop in following benchmark case is not recognized by
compiler as expected:
public void microByteGather256() {
for (int i = 0; i < SIZE; i += B256.length()) {
ByteVector.fromArray(B256, barr, 0, index, i)
.intoArray(bres, i);
}
}
```
The loop iv phi node is not recognized successfully when C2 recognize the
counted loop pattern, because it was casted twice with `CastII` in this case.
The ideal graph looks like:
Loop
\
\ / -------------------------
\ /
|
Phi
|
|
|
CastII
|
|
|
CastII
|
|
|
\ ConI
|
\ |
|
AddVI |
|--------------------|
Relative code is
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1667.
Before the change, the graph should be:
Loop
\
\ / -------------------------
\ /
|
Phi
|
|
|
CastII
|
|
| |
\ ConI
|
\ |
|
AddVI |
|--------------------|
```
The difference comes from the index generation in `ByteVector.fromArray()` (I
mean calling of `IntVector.fromArray()` in java). Before, the `i` in above loop
is not directly used by `IntVector.fromArray()`. Instead, it was used by
another loop phi node. Hence, there is no additional `CastII`. But after my
change, the loop in the java implementation of `ByteVector.fromArray()` is
removed, and `i` is directly used by `IntVector.fromArray()` . It will be used
by boundary check before loading the indexes. Hence another `CastII` is
generated.
When recognizing the loop iv phi node, it will check whether the `Phi` is used
by a `CastII` first. And get the input of the `CastII` if then. But this check
only happens once. See
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1659
if (xphi->Opcode() == Op_Cast(iv_bt)) {
xphi = xphi->in(1);
}
Once the `PhiNode` is casted more than one times, the pattern is not
recognized. I think we should refine the logic of recognizing the counted loop,
by changing above `if` to `while` to make the iv phi node is recognized
successfully. Potential change should be:
while (xphi->Opcode() == Op_Cast(iv_bt)) {
xphi = xphi->in(1);
}
I'v tested this change, and found the benchmarks with regression can be
improved as before. Consider I'm not familiar with C2's loop transform code, I
prefer to do more investigation for this issue, and may fix it with a
followed-up patch. Any suggestions?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2892720654