### Summary:
[JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the 
hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This 
patch aims at implementing the equivalent functionality for AArch64 SVE 
platform. In addition to the AArch64 backend support, this patch also refactors 
the API implementation in Java side and the compiler mid-end part to make the 
operations more efficient and maintainable across different architectures.

### Background:
Vector gather load APIs load values from memory addresses calculated by adding 
a base pointer to integer indices stored in an int array. SVE provides native 
vector gather load instructions for byte/short types using an int vector saving 
indices (see [2][3]).

The number of loaded elements must match the index vector's element count. 
Since int elements are 4/2 times larger than byte/short elements, and given 
`MaxVectorSize` constraints, the operation may need to be splitted into 
multiple parts.

Using a 128-bit byte vector gather load as an example, there are four scenarios 
with different `MaxVectorSize`:

1. `MaxVectorSize = 16, byte_vector_size = 16`:
   - Can load 4 indices per vector register
   - So can finish 4 bytes per gather-load operation
   - Requires 4 times of gather-loads and final merge
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
   int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   4 gather-load:
   idx_v1 = [1 4 2 3]    gather_v1 = [0000 0000 0000 becd]
   idx_v2 = [2 5 7 5]    gather_v2 = [0000 0000 0000 cfhf]
   idx_v3 = [1 7 6 0]    gather_v3 = [0000 0000 0000 bhga]
   idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
   merge: v = [jlkp bhga cfhf becd]
   ```

2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
   - Can load 8 indices per vector register
   - So can finish 8 bytes per gather-load operation
   - Requires 2 times of gather-loads and merge
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   2 gather-load:
   idx_v1 = [2 5 7 5 1 4 2 3]
   idx_v2 = [9 11 10 15 1 7 6 0]
   gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
   gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
   merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`:
   - Can load 16 indices per vector register
   - So can finish 16 bytes per gather-load operation
   - No splitting required
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   1 gather-load:
   idx_v = [9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
   v = [... 0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

4. `MaxVectorSize > 64, byte_vector_size < MaxVectorSize / 4`:
   - Can load 32+ indices per vector register
   - So can finish 16 bytes per gather-load operation
   - Requires masking to allow loading 16 active elements to keep safe
     memory access.
   Example:
   ```
   byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
   int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]

   1 gather-load:
   idx_v = [... 0 0 0 0 0 0 0 0 9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
   v = [... 0000 0000 0000 0000 0000 jlkp bhga cfhf becd]
   ```

### Main changes:
1. Java-side API refactoring:
   - Potential multiple index vectors have been generated for index checking in 
java-side. This patch passes all the generated index vectors to hotspot to 
eliminate the duplicate index vectors used for the vector gather load 
operations on architectures like AArch64. Existing IGVN cannot work due to the 
different control flow of the index vectors generated in java-side and compiler 
intrinsifying.
2. C2 compiler IR refactoring:
   - Generate different IR patterns for different architectures like AArch64 
and X86, based on the different index requirements.
   - Added two new IRs in C2 compiler to help implement each part of vector 
gather operation and merge the results at last.
   - Refactored the `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword 
types. This patch removes the memory offset input and add it to the memory base 
`addr` in IR level for architectures that need the index array like X86. This 
not only simplifies the backend implementation, but also saves some add 
operations. Additionally, it unifies the IR for all types.
3. Backend changes:
   - Added SVE match rules for subword gather load operations and the new added 
IRs.
   - Refined the X86 implementation of subword gather since the offset input 
has been removed from the IR level.
4. Test:
   - Added IR tests for verification.

### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- Passed vector api tests with all `UseAVX
` flags on X86 and `UseSVE` flags on AArch64
- No regressions found

### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA 
GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance 
data:


Benchmark                                                (SIZE)   Mode Cnt  
Units    Before     After    Gain
GatherOperationsBenchmark.microByteGather128                 64  thrpt  30  
ops/ms  13447.414 43184.611  3.21
GatherOperationsBenchmark.microByteGather128                256  thrpt  30  
ops/ms   3361.944 11165.006  3.32
GatherOperationsBenchmark.microByteGather128               1024  thrpt  30  
ops/ms    843.501  2830.108  3.35
GatherOperationsBenchmark.microByteGather128               4096  thrpt  30  
ops/ms    211.096   712.958  3.37
GatherOperationsBenchmark.microByteGather128_MASK            64  thrpt  30  
ops/ms  10627.297 42818.402  4.02
GatherOperationsBenchmark.microByteGather128_MASK           256  thrpt  30  
ops/ms   2675.144 11055.874  4.13
GatherOperationsBenchmark.microByteGather128_MASK          1024  thrpt  30  
ops/ms    677.742  2783.920  4.10
GatherOperationsBenchmark.microByteGather128_MASK          4096  thrpt  30  
ops/ms    169.416   686.783  4.05
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF     64  thrpt  30  
ops/ms  10592.545 42282.802  3.99
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    256  thrpt  30  
ops/ms   2680.060 11039.563  4.11
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   1024  thrpt  30  
ops/ms    678.941  2790.252  4.10
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   4096  thrpt  30  
ops/ms    169.985   691.157  4.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF          64  thrpt  30  
ops/ms  13538.308 42954.988  3.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF         256  thrpt  30  
ops/ms   3414.237 11227.333  3.28
GatherOperationsBenchmark.microByteGather128_NZ_OFF        1024  thrpt  30  
ops/ms    850.098  2821.821  3.31
GatherOperationsBenchmark.microByteGather128_NZ_OFF        4096  thrpt  30  
ops/ms    213.295   705.015  3.30
GatherOperationsBenchmark.microByteGather64                  64  thrpt  30  
ops/ms   8705.935 44213.982  5.07
GatherOperationsBenchmark.microByteGather64                 256  thrpt  30  
ops/ms   2186.620 11407.364  5.21
GatherOperationsBenchmark.microByteGather64                1024  thrpt  30  
ops/ms    545.364  2845.370  5.21
GatherOperationsBenchmark.microByteGather64                4096  thrpt  30  
ops/ms    136.376   718.532  5.26
GatherOperationsBenchmark.microByteGather64_MASK             64  thrpt  30  
ops/ms   6530.636 42053.044  6.43
GatherOperationsBenchmark.microByteGather64_MASK            256  thrpt  30  
ops/ms   1644.069 11323.223  6.88
GatherOperationsBenchmark.microByteGather64_MASK           1024  thrpt  30  
ops/ms    416.093  2844.712  6.83
GatherOperationsBenchmark.microByteGather64_MASK           4096  thrpt  30  
ops/ms    105.777   716.685  6.77
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF      64  thrpt  30  
ops/ms   6619.260 42204.919  6.37
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     256  thrpt  30  
ops/ms   1668.304 11318.298  6.78
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    1024  thrpt  30  
ops/ms    422.085  2844.398  6.73
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    4096  thrpt  30  
ops/ms    105.722   716.543  6.77
GatherOperationsBenchmark.microByteGather64_NZ_OFF           64  thrpt  30  
ops/ms   8754.073 44232.985  5.05
GatherOperationsBenchmark.microByteGather64_NZ_OFF          256  thrpt  30  
ops/ms   2195.009 11408.702  5.19
GatherOperationsBenchmark.microByteGather64_NZ_OFF         1024  thrpt  30  
ops/ms    546.530  2845.369  5.20
GatherOperationsBenchmark.microByteGather64_NZ_OFF         4096  thrpt  30  
ops/ms    137.713   718.391  5.21
GatherOperationsBenchmark.microShortGather128                64  thrpt  30  
ops/ms   8695.558 33438.398  3.84
GatherOperationsBenchmark.microShortGather128               256  thrpt  30  
ops/ms   2189.766  8533.643  3.89
GatherOperationsBenchmark.microShortGather128              1024  thrpt  30  
ops/ms    546.322  2145.239  3.92
GatherOperationsBenchmark.microShortGather128              4096  thrpt  30  
ops/ms    136.503   537.493  3.93
GatherOperationsBenchmark.microShortGather128_MASK           64  thrpt  30  
ops/ms   6656.883 33571.619  5.04
GatherOperationsBenchmark.microShortGather128_MASK          256  thrpt  30  
ops/ms   1649.233  8533.728  5.17
GatherOperationsBenchmark.microShortGather128_MASK         1024  thrpt  30  
ops/ms    421.687  2135.280  5.06
GatherOperationsBenchmark.microShortGather128_MASK         4096  thrpt  30  
ops/ms    105.355   537.418  5.10
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF    64  thrpt  30  
ops/ms   6675.782 33441.402  5.00
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   256  thrpt  30  
ops/ms   1681.000  8532.770  5.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  1024  thrpt  30  
ops/ms    424.024  2135.485  5.03
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  4096  thrpt  30  
ops/ms    106.507   537.674  5.04
GatherOperationsBenchmark.microShortGather128_NZ_OFF         64  thrpt  30  
ops/ms   8796.279 33441.738  3.80
GatherOperationsBenchmark.microShortGather128_NZ_OFF        256  thrpt  30  
ops/ms   2198.774  8562.333  3.89
GatherOperationsBenchmark.microShortGather128_NZ_OFF       1024  thrpt  30  
ops/ms    546.991  2133.496  3.90
GatherOperationsBenchmark.microShortGather128_NZ_OFF       4096  thrpt  30  
ops/ms    137.191   537.390  3.91
GatherOperationsBenchmark.microShortGather64                 64  thrpt  30  
ops/ms   5286.569 38042.434  7.19
GatherOperationsBenchmark.microShortGather64                256  thrpt  30  
ops/ms   1312.778  9755.474  7.43
GatherOperationsBenchmark.microShortGather64               1024  thrpt  30  
ops/ms    327.475  2450.755  7.48
GatherOperationsBenchmark.microShortGather64               4096  thrpt  30  
ops/ms     82.490   613.481  7.43
GatherOperationsBenchmark.microShortGather64_MASK            64  thrpt  30  
ops/ms   3525.102 37622.086  10.67
GatherOperationsBenchmark.microShortGather64_MASK           256  thrpt  30  
ops/ms    877.877  9740.673  11.09
GatherOperationsBenchmark.microShortGather64_MASK          1024  thrpt  30  
ops/ms    219.688  2446.063  11.13
GatherOperationsBenchmark.microShortGather64_MASK          4096  thrpt  30  
ops/ms     54.935   613.137  11.16
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF     64  thrpt  30  
ops/ms   3509.264 35147.895  10.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    256  thrpt  30  
ops/ms    880.523  9733.536  11.05
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   1024  thrpt  30  
ops/ms    220.578  2465.951  11.17
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   4096  thrpt  30  
ops/ms     55.790   620.465  11.12
GatherOperationsBenchmark.microShortGather64_NZ_OFF          64  thrpt  30  
ops/ms   5271.218 35543.510  6.74
GatherOperationsBenchmark.microShortGather64_NZ_OFF         256  thrpt  30  
ops/ms   1318.470  9735.321  7.38
GatherOperationsBenchmark.microShortGather64_NZ_OFF        1024  thrpt  30  
ops/ms    328.695  2466.311  7.50
GatherOperationsBenchmark.microShortGather64_NZ_OFF        4096  thrpt  30  
ops/ms     81.959   621.065  7.57



And here is the performance data on a X86 avx512 system, which shows the 
performance can improve at most 39%.


Benchmark                                                (SIZE)   Mode Cnt  
Units    Before      After    Gain
GatherOperationsBenchmark.microByteGather128                 64  thrpt  30  
ops/ms  44205.252  46829.437  1.05
GatherOperationsBenchmark.microByteGather128                256  thrpt  30  
ops/ms  11243.202  12256.211  1.09
GatherOperationsBenchmark.microByteGather128               1024  thrpt  30  
ops/ms   2824.094   3096.282  1.09
GatherOperationsBenchmark.microByteGather128               4096  thrpt  30  
ops/ms    706.040    776.444  1.09
GatherOperationsBenchmark.microByteGather128_MASK            64  thrpt  30  
ops/ms  46911.410  46321.310  0.98
GatherOperationsBenchmark.microByteGather128_MASK           256  thrpt  30  
ops/ms  12850.712  12898.541  1.00
GatherOperationsBenchmark.microByteGather128_MASK          1024  thrpt  30  
ops/ms   3099.038   3240.863  1.04
GatherOperationsBenchmark.microByteGather128_MASK          4096  thrpt  30  
ops/ms    795.265    832.990  1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF     64  thrpt  30  
ops/ms  43065.930  47164.936  1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF    256  thrpt  30  
ops/ms  11537.805  13190.759  1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   1024  thrpt  30  
ops/ms   2763.036   3304.582  1.19
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF   4096  thrpt  30  
ops/ms    722.374    843.458  1.16
GatherOperationsBenchmark.microByteGather128_NZ_OFF          64  thrpt  30  
ops/ms  44145.297  46845.845  1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF         256  thrpt  30  
ops/ms  12172.421  12241.941  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF        1024  thrpt  30  
ops/ms   3097.042   3100.228  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF        4096  thrpt  30  
ops/ms    776.453    775.881  0.99
GatherOperationsBenchmark.microByteGather64                  64  thrpt  30  
ops/ms  58541.178  59464.156  1.01
GatherOperationsBenchmark.microByteGather64                 256  thrpt  30  
ops/ms  16063.284  17360.858  1.08
GatherOperationsBenchmark.microByteGather64                1024  thrpt  30  
ops/ms   4126.798   4471.636  1.08
GatherOperationsBenchmark.microByteGather64                4096  thrpt  30  
ops/ms   1045.116   1125.219  1.07
GatherOperationsBenchmark.microByteGather64_MASK             64  thrpt  30  
ops/ms  35344.320  49062.831  1.38
GatherOperationsBenchmark.microByteGather64_MASK            256  thrpt  30  
ops/ms  11946.622  13550.297  1.13
GatherOperationsBenchmark.microByteGather64_MASK           1024  thrpt  30  
ops/ms   3275.053   3359.737  1.02
GatherOperationsBenchmark.microByteGather64_MASK           4096  thrpt  30  
ops/ms    844.575    858.487  1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF      64  thrpt  30  
ops/ms  43550.522  48875.831  1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF     256  thrpt  30  
ops/ms  12216.995  13522.420  1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    1024  thrpt  30  
ops/ms   3053.068   3391.067  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF    4096  thrpt  30  
ops/ms    753.042    869.774  1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF           64  thrpt  30  
ops/ms  52082.307  58847.230  1.12
GatherOperationsBenchmark.microByteGather64_NZ_OFF          256  thrpt  30  
ops/ms  14210.930  17389.898  1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF         1024  thrpt  30  
ops/ms   3697.996   4476.988  1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF         4096  thrpt  30  
ops/ms    921.524   1125.308  1.22
GatherOperationsBenchmark.microShortGather128                64  thrpt  30  
ops/ms  44325.212  44843.853  1.01
GatherOperationsBenchmark.microShortGather128               256  thrpt  30  
ops/ms  11675.510  12630.103  1.08
GatherOperationsBenchmark.microShortGather128              1024  thrpt  30  
ops/ms   1260.004   1373.395  1.09
GatherOperationsBenchmark.microShortGather128              4096  thrpt  30  
ops/ms    761.857    814.790  1.06
GatherOperationsBenchmark.microShortGather128_MASK           64  thrpt  30  
ops/ms  36339.450  36951.803  1.01
GatherOperationsBenchmark.microShortGather128_MASK          256  thrpt  30  
ops/ms   9843.842  10018.754  1.01
GatherOperationsBenchmark.microShortGather128_MASK         1024  thrpt  30  
ops/ms   2515.702   2595.312  1.03
GatherOperationsBenchmark.microShortGather128_MASK         4096  thrpt  30  
ops/ms    616.450    661.402  1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF    64  thrpt  30  
ops/ms  34078.747  33712.577  0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF   256  thrpt  30  
ops/ms   9018.316   8515.947  0.94
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  1024  thrpt  30  
ops/ms   2250.813   2595.847  1.15
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF  4096  thrpt  30  
ops/ms    563.182    659.087  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF         64  thrpt  30  
ops/ms  39909.543  44063.331  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF        256  thrpt  30  
ops/ms  10690.582  12437.166  1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF       1024  thrpt  30  
ops/ms   2677.219   3151.078  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF       4096  thrpt  30  
ops/ms    681.705    802.929  1.17
GatherOperationsBenchmark.microShortGather64                 64  thrpt  30  
ops/ms  45836.789  50883.505  1.11
GatherOperationsBenchmark.microShortGather64                256  thrpt  30  
ops/ms  12269.355  13614.567  1.10
GatherOperationsBenchmark.microShortGather64               1024  thrpt  30  
ops/ms   3010.548   3437.973  1.14
GatherOperationsBenchmark.microShortGather64               4096  thrpt  30  
ops/ms    734.634    899.070  1.22
GatherOperationsBenchmark.microShortGather64_MASK            64  thrpt  30  
ops/ms  39753.487  39319.742  0.98
GatherOperationsBenchmark.microShortGather64_MASK           256  thrpt  30  
ops/ms  10615.540  10648.996  1.00
GatherOperationsBenchmark.microShortGather64_MASK          1024  thrpt  30  
ops/ms   2653.485   2782.477  1.04
GatherOperationsBenchmark.microShortGather64_MASK          4096  thrpt  30  
ops/ms    678.165    686.024  1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF     64  thrpt  30  
ops/ms  37742.593  40491.965  1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF    256  thrpt  30  
ops/ms  10096.251  11036.785  1.09
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   1024  thrpt  30  
ops/ms   2526.374   2812.550  1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF   4096  thrpt  30  
ops/ms    642.484    656.152  1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF          64  thrpt  30  
ops/ms  40602.930  50921.048  1.25
GatherOperationsBenchmark.microShortGather64_NZ_OFF         256  thrpt  30  
ops/ms  10972.083  14151.666  1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF        1024  thrpt  30  
ops/ms   2726.248   3662.293  1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF        4096  thrpt  30  
ops/ms    670.735    933.299  1.39


[1] https://bugs.openjdk.org/browse/JDK-8318650
[2] 
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] 
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en

-------------

Commit messages:
 - 8351623: VectorAPI: Refactor subword gather load and add SVE implementation

Changes: https://git.openjdk.org/jdk/pull/24679/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24679&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8351623
  Stats: 1367 lines in 34 files changed: 915 ins; 180 del; 272 mod
  Patch: https://git.openjdk.org/jdk/pull/24679.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24679/head:pull/24679

PR: https://git.openjdk.org/jdk/pull/24679

Reply via email to