### Summary:
[JDK-8318650](http://java-service.client.nvidia.com/?q=8318650) added the
hotspot intrinsifying of subword gather load APIs for X86 platforms [1]. This
patch aims at implementing the equivalent functionality for AArch64 SVE
platform. In addition to the AArch64 backend support, this patch also refactors
the API implementation in Java side and the compiler mid-end part to make the
operations more efficient and maintainable across different architectures.
### Background:
Vector gather load APIs load values from memory addresses calculated by adding
a base pointer to integer indices stored in an int array. SVE provides native
vector gather load instructions for byte/short types using an int vector saving
indices (see [2][3]).
The number of loaded elements must match the index vector's element count.
Since int elements are 4/2 times larger than byte/short elements, and given
`MaxVectorSize` constraints, the operation may need to be splitted into
multiple parts.
Using a 128-bit byte vector gather load as an example, there are four scenarios
with different `MaxVectorSize`:
1. `MaxVectorSize = 16, byte_vector_size = 16`:
- Can load 4 indices per vector register
- So can finish 4 bytes per gather-load operation
- Requires 4 times of gather-loads and final merge
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, g, k, l, m, n, o, p, ...]
int[] idx = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
4 gather-load:
idx_v1 = [1 4 2 3] gather_v1 = [0000 0000 0000 becd]
idx_v2 = [2 5 7 5] gather_v2 = [0000 0000 0000 cfhf]
idx_v3 = [1 7 6 0] gather_v3 = [0000 0000 0000 bhga]
idx_v4 = [9 11 10 15] gather_v4 = [0000 0000 0000 jlkp]
merge: v = [jlkp bhga cfhf becd]
```
2. `MaxVectorSize = 32, byte_vector_size = MaxVectorSize / 2`:
- Can load 8 indices per vector register
- So can finish 8 bytes per gather-load operation
- Requires 2 times of gather-loads and merge
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
2 gather-load:
idx_v1 = [2 5 7 5 1 4 2 3]
idx_v2 = [9 11 10 15 1 7 6 0]
gather_v1 = [0000 0000 0000 0000 0000 0000 cfhf becd]
gather_v2 = [0000 0000 0000 0000 0000 0000 jlkp bhga]
merge: v = [0000 0000 0000 0000 jlkp bhga cfhf becd]
```
3. `MaxVectorSize = 64, byte_vector_size = MaxVectorSize / 4`:
- Can load 16 indices per vector register
- So can finish 16 bytes per gather-load operation
- No splitting required
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
1 gather-load:
idx_v = [9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
v = [... 0000 0000 0000 0000 jlkp bhga cfhf becd]
```
4. `MaxVectorSize > 64, byte_vector_size < MaxVectorSize / 4`:
- Can load 32+ indices per vector register
- So can finish 16 bytes per gather-load operation
- Requires masking to allow loading 16 active elements to keep safe
memory access.
Example:
```
byte[] arr = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, ...]
int[] index = [3, 2, 4, 1, 5, 7, 5, 2, 0, 6, 7, 1, 15, 10, 11, 9]
1 gather-load:
idx_v = [... 0 0 0 0 0 0 0 0 9 11 10 15 1 7 6 0 2 5 7 5 1 4 2 3]
v = [... 0000 0000 0000 0000 0000 jlkp bhga cfhf becd]
```
### Main changes:
1. Java-side API refactoring:
- Potential multiple index vectors have been generated for index checking in
java-side. This patch passes all the generated index vectors to hotspot to
eliminate the duplicate index vectors used for the vector gather load
operations on architectures like AArch64. Existing IGVN cannot work due to the
different control flow of the index vectors generated in java-side and compiler
intrinsifying.
2. C2 compiler IR refactoring:
- Generate different IR patterns for different architectures like AArch64
and X86, based on the different index requirements.
- Added two new IRs in C2 compiler to help implement each part of vector
gather operation and merge the results at last.
- Refactored the `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword
types. This patch removes the memory offset input and add it to the memory base
`addr` in IR level for architectures that need the index array like X86. This
not only simplifies the backend implementation, but also saves some add
operations. Additionally, it unifies the IR for all types.
3. Backend changes:
- Added SVE match rules for subword gather load operations and the new added
IRs.
- Refined the X86 implementation of subword gather since the offset input
has been removed from the IR level.
4. Test:
- Added IR tests for verification.
### Testing:
- Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
- Passed vector api tests with all `UseAVX
` flags on X86 and `UseSVE` flags on AArch64
- No regressions found
### Performance:
The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA
GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance
data:
Benchmark (SIZE) Mode Cnt
Units Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30
ops/ms 13447.414 43184.611 3.21
GatherOperationsBenchmark.microByteGather128 256 thrpt 30
ops/ms 3361.944 11165.006 3.32
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30
ops/ms 843.501 2830.108 3.35
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30
ops/ms 211.096 712.958 3.37
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30
ops/ms 10627.297 42818.402 4.02
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30
ops/ms 2675.144 11055.874 4.13
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30
ops/ms 677.742 2783.920 4.10
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30
ops/ms 169.416 686.783 4.05
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30
ops/ms 10592.545 42282.802 3.99
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30
ops/ms 2680.060 11039.563 4.11
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30
ops/ms 678.941 2790.252 4.10
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30
ops/ms 169.985 691.157 4.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30
ops/ms 13538.308 42954.988 3.17
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30
ops/ms 3414.237 11227.333 3.28
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30
ops/ms 850.098 2821.821 3.31
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30
ops/ms 213.295 705.015 3.30
GatherOperationsBenchmark.microByteGather64 64 thrpt 30
ops/ms 8705.935 44213.982 5.07
GatherOperationsBenchmark.microByteGather64 256 thrpt 30
ops/ms 2186.620 11407.364 5.21
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30
ops/ms 545.364 2845.370 5.21
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30
ops/ms 136.376 718.532 5.26
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30
ops/ms 6530.636 42053.044 6.43
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30
ops/ms 1644.069 11323.223 6.88
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30
ops/ms 416.093 2844.712 6.83
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30
ops/ms 105.777 716.685 6.77
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30
ops/ms 6619.260 42204.919 6.37
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30
ops/ms 1668.304 11318.298 6.78
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30
ops/ms 422.085 2844.398 6.73
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30
ops/ms 105.722 716.543 6.77
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30
ops/ms 8754.073 44232.985 5.05
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30
ops/ms 2195.009 11408.702 5.19
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30
ops/ms 546.530 2845.369 5.20
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30
ops/ms 137.713 718.391 5.21
GatherOperationsBenchmark.microShortGather128 64 thrpt 30
ops/ms 8695.558 33438.398 3.84
GatherOperationsBenchmark.microShortGather128 256 thrpt 30
ops/ms 2189.766 8533.643 3.89
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30
ops/ms 546.322 2145.239 3.92
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30
ops/ms 136.503 537.493 3.93
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30
ops/ms 6656.883 33571.619 5.04
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30
ops/ms 1649.233 8533.728 5.17
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30
ops/ms 421.687 2135.280 5.06
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30
ops/ms 105.355 537.418 5.10
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30
ops/ms 6675.782 33441.402 5.00
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30
ops/ms 1681.000 8532.770 5.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30
ops/ms 424.024 2135.485 5.03
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30
ops/ms 106.507 537.674 5.04
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30
ops/ms 8796.279 33441.738 3.80
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30
ops/ms 2198.774 8562.333 3.89
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30
ops/ms 546.991 2133.496 3.90
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30
ops/ms 137.191 537.390 3.91
GatherOperationsBenchmark.microShortGather64 64 thrpt 30
ops/ms 5286.569 38042.434 7.19
GatherOperationsBenchmark.microShortGather64 256 thrpt 30
ops/ms 1312.778 9755.474 7.43
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30
ops/ms 327.475 2450.755 7.48
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30
ops/ms 82.490 613.481 7.43
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30
ops/ms 3525.102 37622.086 10.67
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30
ops/ms 877.877 9740.673 11.09
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30
ops/ms 219.688 2446.063 11.13
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30
ops/ms 54.935 613.137 11.16
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30
ops/ms 3509.264 35147.895 10.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30
ops/ms 880.523 9733.536 11.05
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30
ops/ms 220.578 2465.951 11.17
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30
ops/ms 55.790 620.465 11.12
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30
ops/ms 5271.218 35543.510 6.74
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30
ops/ms 1318.470 9735.321 7.38
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30
ops/ms 328.695 2466.311 7.50
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30
ops/ms 81.959 621.065 7.57
And here is the performance data on a X86 avx512 system, which shows the
performance can improve at most 39%.
Benchmark (SIZE) Mode Cnt
Units Before After Gain
GatherOperationsBenchmark.microByteGather128 64 thrpt 30
ops/ms 44205.252 46829.437 1.05
GatherOperationsBenchmark.microByteGather128 256 thrpt 30
ops/ms 11243.202 12256.211 1.09
GatherOperationsBenchmark.microByteGather128 1024 thrpt 30
ops/ms 2824.094 3096.282 1.09
GatherOperationsBenchmark.microByteGather128 4096 thrpt 30
ops/ms 706.040 776.444 1.09
GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30
ops/ms 46911.410 46321.310 0.98
GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30
ops/ms 12850.712 12898.541 1.00
GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30
ops/ms 3099.038 3240.863 1.04
GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30
ops/ms 795.265 832.990 1.04
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30
ops/ms 43065.930 47164.936 1.09
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30
ops/ms 11537.805 13190.759 1.14
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30
ops/ms 2763.036 3304.582 1.19
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30
ops/ms 722.374 843.458 1.16
GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30
ops/ms 44145.297 46845.845 1.06
GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30
ops/ms 12172.421 12241.941 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30
ops/ms 3097.042 3100.228 1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30
ops/ms 776.453 775.881 0.99
GatherOperationsBenchmark.microByteGather64 64 thrpt 30
ops/ms 58541.178 59464.156 1.01
GatherOperationsBenchmark.microByteGather64 256 thrpt 30
ops/ms 16063.284 17360.858 1.08
GatherOperationsBenchmark.microByteGather64 1024 thrpt 30
ops/ms 4126.798 4471.636 1.08
GatherOperationsBenchmark.microByteGather64 4096 thrpt 30
ops/ms 1045.116 1125.219 1.07
GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30
ops/ms 35344.320 49062.831 1.38
GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30
ops/ms 11946.622 13550.297 1.13
GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30
ops/ms 3275.053 3359.737 1.02
GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30
ops/ms 844.575 858.487 1.01
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30
ops/ms 43550.522 48875.831 1.12
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30
ops/ms 12216.995 13522.420 1.10
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30
ops/ms 3053.068 3391.067 1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30
ops/ms 753.042 869.774 1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30
ops/ms 52082.307 58847.230 1.12
GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30
ops/ms 14210.930 17389.898 1.22
GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30
ops/ms 3697.996 4476.988 1.21
GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30
ops/ms 921.524 1125.308 1.22
GatherOperationsBenchmark.microShortGather128 64 thrpt 30
ops/ms 44325.212 44843.853 1.01
GatherOperationsBenchmark.microShortGather128 256 thrpt 30
ops/ms 11675.510 12630.103 1.08
GatherOperationsBenchmark.microShortGather128 1024 thrpt 30
ops/ms 1260.004 1373.395 1.09
GatherOperationsBenchmark.microShortGather128 4096 thrpt 30
ops/ms 761.857 814.790 1.06
GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30
ops/ms 36339.450 36951.803 1.01
GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30
ops/ms 9843.842 10018.754 1.01
GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30
ops/ms 2515.702 2595.312 1.03
GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30
ops/ms 616.450 661.402 1.07
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30
ops/ms 34078.747 33712.577 0.98
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30
ops/ms 9018.316 8515.947 0.94
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30
ops/ms 2250.813 2595.847 1.15
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30
ops/ms 563.182 659.087 1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30
ops/ms 39909.543 44063.331 1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30
ops/ms 10690.582 12437.166 1.16
GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30
ops/ms 2677.219 3151.078 1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30
ops/ms 681.705 802.929 1.17
GatherOperationsBenchmark.microShortGather64 64 thrpt 30
ops/ms 45836.789 50883.505 1.11
GatherOperationsBenchmark.microShortGather64 256 thrpt 30
ops/ms 12269.355 13614.567 1.10
GatherOperationsBenchmark.microShortGather64 1024 thrpt 30
ops/ms 3010.548 3437.973 1.14
GatherOperationsBenchmark.microShortGather64 4096 thrpt 30
ops/ms 734.634 899.070 1.22
GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30
ops/ms 39753.487 39319.742 0.98
GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30
ops/ms 10615.540 10648.996 1.00
GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30
ops/ms 2653.485 2782.477 1.04
GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30
ops/ms 678.165 686.024 1.01
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30
ops/ms 37742.593 40491.965 1.07
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30
ops/ms 10096.251 11036.785 1.09
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30
ops/ms 2526.374 2812.550 1.11
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30
ops/ms 642.484 656.152 1.02
GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30
ops/ms 40602.930 50921.048 1.25
GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30
ops/ms 10972.083 14151.666 1.28
GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30
ops/ms 2726.248 3662.293 1.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30
ops/ms 670.735 933.299 1.39
[1] https://bugs.openjdk.org/browse/JDK-8318650
[2]
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3]
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en
-------------
Commit messages:
- 8351623: VectorAPI: Refactor subword gather load and add SVE implementation
Changes: https://git.openjdk.org/jdk/pull/24679/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24679&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8351623
Stats: 1367 lines in 34 files changed: 915 ins; 180 del; 272 mod
Patch: https://git.openjdk.org/jdk/pull/24679.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/24679/head:pull/24679
PR: https://git.openjdk.org/jdk/pull/24679