On Thu, 17 Apr 2025 01:42:22 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> ### Summary: >> [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the >> hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. >> This patch aims at implementing the equivalent functionality for AArch64 SVE >> platform. In addition to the AArch64 backend support, this patch also >> refactors the API implementation in Java side and the compiler mid-end part >> to make the operations more efficient and maintainable across different >> architectures. >> >> ### Background: >> Vector gather load APIs load values from memory addresses calculated by >> adding a base pointer to integer indices stored in an int array. SVE >> provides native vector gather load instructions for byte/short types using >> an int vector saving indices (see [2][3]). >> >> The number of loaded elements must match the index vector's element count. >> Since int elements are 4/2 times larger than byte/short elements, and given >> `MaxVectorSize` constraints, the operation may need to be splitted into >> multiple parts. >> >> Using a 128-bit byte vector gather load as an example, there are four >> scenarios with different `MaxVectorSize`: >> >> 1. `MaxVectorSize = 16, byte_vector_size = 16`: >> - Can load 4 indices per vector register >> - So can finish 4 bytes per gather-load operation >> - Requires 4 times of gather-loads and final merge >> Example: >> ``` >> byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...] >> int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9] >> >> 4 gather-load: >> idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd] >> idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf] >> idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga] >> idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp] >> merge: v = [jlkp bhga cfhf becd] >> ``` >> >> 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`: >> - Can load 8 indices per vector register >> - So can finish 8 bytes per gather-load operation >> - Requires 2 times of gather-loads and merge >> Example: >> ``` >> byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...] >> int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9] >> >> 2 gather-load: >> idx_v1 = [2 5 7 5 1 4 2 3] >> idx_v2 = [9 11 10 15 1 7 6 0] >> gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd] >> gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga] >> merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd] >> ``` >> >> 3. `MaxVectorSize = 64, byte_v... > > Hi @jatin-bhateja , could you please help take a look at this PR especially > the X86 part? Thanks a lot! > Hi @RealFYang , could you please help review the RVV part? Thanks a lot! @XiaohongGong I had a quick look at your changes and PR description. I wonder if you could split some of the refactoring into a separate PR? That would make it easier to review. Currently, you basically have x64 changes, aarch64 changes, Java library changes, and C2 changes. That's a lot at once. And it would basically require the review from a lot of different people at once. Splitting would make it easier to review, less work for the reviewer. It would ensure everybody can look at a smaller change set, and that would also increase the quality of the code after review, I think. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/24679#issuecomment-2824229233