Re: RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Emanuel Peter Wed, 23 Apr 2025 06:06:13 -0700

On Thu, 17 Apr 2025 01:42:22 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:


>> ### Summary:
>> [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the 
>> hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. 
>> This patch aims at implementing the equivalent functionality for AArch64 SVE 
>> platform. In addition to the AArch64 backend support, this patch also 
>> refactors the API implementation in Java side and the compiler mid-end part 
>> to make the operations more efficient and maintainable across different 
>> architectures.
>> 
>> ### Background:
>> Vector gather load APIs load values from memory addresses calculated by 
>> adding a base pointer to integer indices stored in an int array. SVE 
>> provides native vector gather load instructions for byte/short types using 
>> an int vector saving indices (see [2][3]).
>> 
>> The number of loaded elements must match the index vector's element count. 
>> Since int elements are 4/2 times larger than byte/short elements, and given 
>> `MaxVectorSize` constraints, the operation may need to be splitted into 
>> multiple parts.
>> 
>> Using a 128-bit byte vector gather load as an example, there are four 
>> scenarios with different `MaxVectorSize`:
>> 
>> 1. `MaxVectorSize = 16, byte_vector_size = 16`:
>>    - Can load 4 indices per vector register
>>    - So can finish 4 bytes per gather-load operation
>>    - Requires 4 times of gather-loads and final merge
>>    Example:
>>    ```
>>    byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
>>    int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>> 
>>    4 gather-load:
>>    idx_v1 = [1 4 2 3]    gather_v1 = [0000 0000 0000 becd]
>>    idx_v2 = [2 5 7 5]    gather_v2 = [0000 0000 0000 cfhf]
>>    idx_v3 = [1 7 6 0]    gather_v3 = [0000 0000 0000 bhga]
>>    idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
>>    merge: v = [jlkp bhga cfhf becd]
>>    ```
>> 
>> 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
>>    - Can load 8 indices per vector register
>>    - So can finish 8 bytes per gather-load operation
>>    - Requires 2 times of gather-loads and merge
>>    Example:
>>    ```
>>    byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
>>    int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
>> 
>>    2 gather-load:
>>    idx_v1 = [2 5 7 5 1 4 2 3]
>>    idx_v2 = [9 11 10 15 1 7 6 0]
>>    gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
>>    gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
>>    merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
>>    ```
>> 
>> 3. `MaxVectorSize = 64, byte_v...
>
> Hi @jatin-bhateja , could you please help take a look at this PR especially 
> the X86 part? Thanks a lot!
> Hi @RealFYang , could you please help review the RVV part? Thanks a lot!

@XiaohongGong I had a quick look at your changes and PR description. I wonder 
if you could split some of the refactoring into a separate PR? That would make 
it easier to review. Currently, you basically have x64 changes, aarch64 
changes, Java library changes, and C2 changes. That's a lot at once. And it 
would basically require the review from a lot of different people at once.

Splitting would make it easier to review, less work for the reviewer. It would 
ensure everybody can look at a smaller change set, and that would also increase 
the quality of the code after review, I think.

What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24679#issuecomment-2824229233

Re: RFR: 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Reply via email to