On Wed, 16 Apr 2025 08:58:34 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> ### Summary: > [JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the > hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This > patch aims at implementing the equivalent functionality for AArch64 SVE > platform. In addition to the AArch64 backend support, this patch also > refactors the API implementation in Java side and the compiler mid-end part > to make the operations more efficient and maintainable across different > architectures. > > ### Background: > Vector gather load APIs load values from memory addresses calculated by > adding a base pointer to integer indices stored in an int array. SVE provides > native vector gather load instructions for byte/short types using an int > vector saving indices (see [2][3]). > > The number of loaded elements must match the index vector's element count. > Since int elements are 4/2 times larger than byte/short elements, and given > `MaxVectorSize` constraints, the operation may need to be splitted into > multiple parts. > > Using a 128-bit byte vector gather load as an example, there are four > scenarios with different `MaxVectorSize`: > > 1. `MaxVectorSize = 16, byte_vector_size = 16`: > - Can load 4 indices per vector register > - So can finish 4 bytes per gather-load operation > - Requires 4 times of gather-loads and final merge > Example: > ``` > byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...] > int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9] > > 4 gather-load: > idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd] > idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf] > idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga] > idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp] > merge: v = [jlkp bhga cfhf becd] > ``` > > 2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`: > - Can load 8 indices per vector register > - So can finish 8 bytes per gather-load operation > - Requires 2 times of gather-loads and merge > Example: > ``` > byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...] > int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9] > > 2 gather-load: > idx_v1 = [2 5 7 5 1 4 2 3] > idx_v2 = [9 11 10 15 1 7 6 0] > gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd] > gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga] > merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd] > ``` > > 3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`: > - Can load 16 indices per vector register > - So can ... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24679