JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
Two key areas require improvement: 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. Main changes: 1. Java-side API refactoring: - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on architectures like AArch64. 2. C2 compiler IR refactoring: - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. 3. Backend changes: - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. Performance: The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: Benchmark Mode Cnt Unit SIZE Before After Gain GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.312 935.269 1.02 GatherOperationsBenchmark.microByteGather128_MASK thrpt 30 ops/ms 64 43040.148 44605.580 1.03 GatherOperationsBenchmark.microByteGather128_MASK thrpt 30 ops/ms 256 12445.650 12928.102 1.03 GatherOperationsBenchmark.microByteGather128_MASK thrpt 30 ops/ms 1024 3143.728 3294.173 1.04 GatherOperationsBenchmark.microByteGather128_MASK thrpt 30 ops/ms 4096 801.516 842.951 1.05 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF thrpt 30 ops/ms 64 40379.343 45255.490 1.12 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF thrpt 30 ops/ms 256 11103.537 12971.581 1.16 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF thrpt 30 ops/ms 1024 2767.870 3299.453 1.19 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF thrpt 30 ops/ms 4096 704.610 840.908 1.19 GatherOperationsBenchmark.microByteGather128_NZ_OFF thrpt 30 ops/ms 64 49066.340 53365.591 1.08 GatherOperationsBenchmark.microByteGather128_NZ_OFF thrpt 30 ops/ms 256 14063.326 14286.067 1.01 GatherOperationsBenchmark.microByteGather128_NZ_OFF thrpt 30 ops/ms 1024 3617.992 3621.272 1.00 GatherOperationsBenchmark.microByteGather128_NZ_OFF thrpt 30 ops/ms 4096 861.026 938.055 1.08 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 64 55844.814 48311.847 0.86 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 256 15139.459 13009.848 0.85 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 1024 3861.834 3284.944 0.85 GatherOperationsBenchmark.microByteGather256 thrpt 30 ops/ms 4096 938.665 817.673 0.87 GatherOperationsBenchmark.microByteGather256_MASK thrpt 30 ops/ms 64 43942.924 43144.065 0.98 GatherOperationsBenchmark.microByteGather256_MASK thrpt 30 ops/ms 256 12461.170 11580.981 0.92 GatherOperationsBenchmark.microByteGather256_MASK thrpt 30 ops/ms 1024 3168.598 2945.698 0.92 GatherOperationsBenchmark.microByteGather256_MASK thrpt 30 ops/ms 4096 803.515 738.049 0.91 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF thrpt 30 ops/ms 64 42197.440 43209.913 1.02 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF thrpt 30 ops/ms 256 11456.761 11713.265 1.02 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF thrpt 30 ops/ms 1024 2732.576 2949.724 1.07 GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF thrpt 30 ops/ms 4096 726.062 744.774 1.02 GatherOperationsBenchmark.microByteGather256_NZ_OFF thrpt 30 ops/ms 64 52915.781 49520.027 0.93 GatherOperationsBenchmark.microByteGather256_NZ_OFF thrpt 30 ops/ms 256 14481.921 13496.835 0.93 GatherOperationsBenchmark.microByteGather256_NZ_OFF thrpt 30 ops/ms 1024 3632.065 3362.372 0.92 GatherOperationsBenchmark.microByteGather256_NZ_OFF thrpt 30 ops/ms 4096 892.825 845.809 0.94 GatherOperationsBenchmark.microByteGather512 thrpt 30 ops/ms 64 54528.404 54478.751 0.99 GatherOperationsBenchmark.microByteGather512 thrpt 30 ops/ms 256 15018.181 14673.727 0.97 GatherOperationsBenchmark.microByteGather512 thrpt 30 ops/ms 1024 3824.690 3589.530 0.93 GatherOperationsBenchmark.microByteGather512 thrpt 30 ops/ms 4096 923.601 906.245 0.98 GatherOperationsBenchmark.microByteGather512_MASK thrpt 30 ops/ms 64 41248.192 42201.455 1.02 GatherOperationsBenchmark.microByteGather512_MASK thrpt 30 ops/ms 256 11481.408 11559.655 1.00 GatherOperationsBenchmark.microByteGather512_MASK thrpt 30 ops/ms 1024 2901.592 2912.954 1.00 GatherOperationsBenchmark.microByteGather512_MASK thrpt 30 ops/ms 4096 732.899 730.381 0.99 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF thrpt 30 ops/ms 64 42287.123 43779.227 1.03 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF thrpt 30 ops/ms 256 11486.167 11448.966 0.99 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF thrpt 30 ops/ms 1024 2888.047 2928.612 1.01 GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF thrpt 30 ops/ms 4096 731.056 738.300 1.00 GatherOperationsBenchmark.microByteGather512_NZ_OFF thrpt 30 ops/ms 64 51777.670 54368.797 1.05 GatherOperationsBenchmark.microByteGather512_NZ_OFF thrpt 30 ops/ms 256 14558.532 14662.164 1.00 GatherOperationsBenchmark.microByteGather512_NZ_OFF thrpt 30 ops/ms 1024 3726.910 3714.448 0.99 GatherOperationsBenchmark.microByteGather512_NZ_OFF thrpt 30 ops/ms 4096 907.863 903.544 0.99 GatherOperationsBenchmark.microByteGather64 thrpt 30 ops/ms 64 52980.507 54970.689 1.03 GatherOperationsBenchmark.microByteGather64 thrpt 30 ops/ms 256 15044.443 15828.237 1.05 GatherOperationsBenchmark.microByteGather64 thrpt 30 ops/ms 1024 3869.028 4098.172 1.05 GatherOperationsBenchmark.microByteGather64 thrpt 30 ops/ms 4096 912.372 1002.065 1.09 GatherOperationsBenchmark.microByteGather64_MASK thrpt 30 ops/ms 64 44267.641 45864.381 1.03 GatherOperationsBenchmark.microByteGather64_MASK thrpt 30 ops/ms 256 12303.206 12920.113 1.05 GatherOperationsBenchmark.microByteGather64_MASK thrpt 30 ops/ms 1024 3100.867 3115.636 1.00 GatherOperationsBenchmark.microByteGather64_MASK thrpt 30 ops/ms 4096 792.004 832.623 1.05 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF thrpt 30 ops/ms 64 40417.638 45844.634 1.13 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF thrpt 30 ops/ms 256 11628.508 12913.170 1.11 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF thrpt 30 ops/ms 1024 2911.508 3260.388 1.11 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF thrpt 30 ops/ms 4096 709.017 835.084 1.17 GatherOperationsBenchmark.microByteGather64_NZ_OFF thrpt 30 ops/ms 64 48868.987 53585.210 1.09 GatherOperationsBenchmark.microByteGather64_NZ_OFF thrpt 30 ops/ms 256 13617.963 15754.029 1.15 GatherOperationsBenchmark.microByteGather64_NZ_OFF thrpt 30 ops/ms 1024 3504.745 3857.926 1.10 GatherOperationsBenchmark.microByteGather64_NZ_OFF thrpt 30 ops/ms 4096 818.439 958.751 1.17 GatherOperationsBenchmark.microShortGather128 thrpt 30 ops/ms 64 41351.719 44337.947 1.07 GatherOperationsBenchmark.microShortGather128 thrpt 30 ops/ms 256 11175.501 12302.557 1.10 GatherOperationsBenchmark.microShortGather128 thrpt 30 ops/ms 1024 2854.546 3158.973 1.10 GatherOperationsBenchmark.microShortGather128 thrpt 30 ops/ms 4096 744.816 790.304 1.06 GatherOperationsBenchmark.microShortGather128_MASK thrpt 30 ops/ms 64 35012.934 35728.068 1.02 GatherOperationsBenchmark.microShortGather128_MASK thrpt 30 ops/ms 256 9408.162 9854.849 1.04 GatherOperationsBenchmark.microShortGather128_MASK thrpt 30 ops/ms 1024 2352.723 2489.161 1.05 GatherOperationsBenchmark.microShortGather128_MASK thrpt 30 ops/ms 4096 595.827 634.225 1.06 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt 30 ops/ms 64 31405.646 35728.077 1.13 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt 30 ops/ms 256 8459.702 9865.482 1.16 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt 30 ops/ms 1024 2095.461 2489.927 1.18 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt 30 ops/ms 4096 535.715 631.614 1.17 GatherOperationsBenchmark.microShortGather128_NZ_OFF thrpt 30 ops/ms 64 39996.604 43811.259 1.09 GatherOperationsBenchmark.microShortGather128_NZ_OFF thrpt 30 ops/ms 256 11058.636 12261.463 1.10 GatherOperationsBenchmark.microShortGather128_NZ_OFF thrpt 30 ops/ms 1024 2847.482 3157.450 1.10 GatherOperationsBenchmark.microShortGather128_NZ_OFF thrpt 30 ops/ms 4096 712.089 790.143 1.10 GatherOperationsBenchmark.microShortGather256 thrpt 30 ops/ms 64 51893.730 51975.295 1.00 GatherOperationsBenchmark.microShortGather256 thrpt 30 ops/ms 256 14226.104 14720.390 1.03 GatherOperationsBenchmark.microShortGather256 thrpt 30 ops/ms 1024 3491.958 3714.266 1.06 GatherOperationsBenchmark.microShortGather256 thrpt 30 ops/ms 4096 852.278 905.330 1.06 GatherOperationsBenchmark.microShortGather256_MASK thrpt 30 ops/ms 64 38736.351 41797.516 1.07 GatherOperationsBenchmark.microShortGather256_MASK thrpt 30 ops/ms 256 10250.508 11790.235 1.15 GatherOperationsBenchmark.microShortGather256_MASK thrpt 30 ops/ms 1024 2558.449 2956.936 1.15 GatherOperationsBenchmark.microShortGather256_MASK thrpt 30 ops/ms 4096 648.882 745.885 1.14 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt 30 ops/ms 64 38315.594 39547.847 1.03 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt 30 ops/ms 256 10471.955 11779.499 1.12 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt 30 ops/ms 1024 2618.623 2679.970 1.02 GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt 30 ops/ms 4096 655.803 760.392 1.15 GatherOperationsBenchmark.microShortGather256_NZ_OFF thrpt 30 ops/ms 64 47674.080 51325.185 1.07 GatherOperationsBenchmark.microShortGather256_NZ_OFF thrpt 30 ops/ms 256 13446.700 14438.516 1.07 GatherOperationsBenchmark.microShortGather256_NZ_OFF thrpt 30 ops/ms 1024 3371.433 3664.720 1.08 GatherOperationsBenchmark.microShortGather256_NZ_OFF thrpt 30 ops/ms 4096 814.540 895.182 1.09 GatherOperationsBenchmark.microShortGather512 thrpt 30 ops/ms 64 48183.553 48374.790 1.01 GatherOperationsBenchmark.microShortGather512 thrpt 30 ops/ms 256 13669.806 12940.433 0.94 GatherOperationsBenchmark.microShortGather512 thrpt 30 ops/ms 1024 3371.708 3318.627 0.98 GatherOperationsBenchmark.microShortGather512 thrpt 30 ops/ms 4096 847.620 805.313 0.95 GatherOperationsBenchmark.microShortGather512_MASK thrpt 30 ops/ms 64 39566.443 42845.296 1.08 GatherOperationsBenchmark.microShortGather512_MASK thrpt 30 ops/ms 256 11926.440 10308.223 0.86 GatherOperationsBenchmark.microShortGather512_MASK thrpt 30 ops/ms 1024 3008.542 2546.197 0.84 GatherOperationsBenchmark.microShortGather512_MASK thrpt 30 ops/ms 4096 764.497 647.276 0.84 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt 30 ops/ms 64 38106.800 42835.120 1.12 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt 30 ops/ms 256 10405.171 11125.164 1.06 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt 30 ops/ms 1024 2526.827 2799.209 1.10 GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt 30 ops/ms 4096 655.044 715.519 1.09 GatherOperationsBenchmark.microShortGather512_NZ_OFF thrpt 30 ops/ms 64 48108.682 46654.427 0.96 GatherOperationsBenchmark.microShortGather512_NZ_OFF thrpt 30 ops/ms 256 13197.197 12957.497 0.98 GatherOperationsBenchmark.microShortGather512_NZ_OFF thrpt 30 ops/ms 1024 3397.959 3244.415 0.95 GatherOperationsBenchmark.microShortGather512_NZ_OFF thrpt 30 ops/ms 4096 824.034 820.536 0.99 GatherOperationsBenchmark.microShortGather64 thrpt 30 ops/ms 64 44815.622 46913.289 1.04 GatherOperationsBenchmark.microShortGather64 thrpt 30 ops/ms 256 12317.166 13536.731 1.09 GatherOperationsBenchmark.microShortGather64 thrpt 30 ops/ms 1024 3157.683 3539.991 1.12 GatherOperationsBenchmark.microShortGather64 thrpt 30 ops/ms 4096 775.626 878.304 1.13 GatherOperationsBenchmark.microShortGather64_MASK thrpt 30 ops/ms 64 37064.157 35649.776 0.96 GatherOperationsBenchmark.microShortGather64_MASK thrpt 30 ops/ms 256 10120.291 9403.1319 0.92 GatherOperationsBenchmark.microShortGather64_MASK thrpt 30 ops/ms 1024 2546.723 2642.781 1.03 GatherOperationsBenchmark.microShortGather64_MASK thrpt 30 ops/ms 4096 644.270 648.432 1.00 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF thrpt 30 ops/ms 64 34386.819 37883.550 1.10 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF thrpt 30 ops/ms 256 9316.097 10500.473 1.12 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF thrpt 30 ops/ms 1024 2344.570 2643.114 1.12 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF thrpt 30 ops/ms 4096 594.445 595.301 1.00 GatherOperationsBenchmark.microShortGather64_NZ_OFF thrpt 30 ops/ms 64 40240.772 48435.477 1.20 GatherOperationsBenchmark.microShortGather64_NZ_OFF thrpt 30 ops/ms 256 11082.392 13736.985 1.23 GatherOperationsBenchmark.microShortGather64_NZ_OFF thrpt 30 ops/ms 1024 2777.065 3549.704 1.27 GatherOperationsBenchmark.microShortGather64_NZ_OFF thrpt 30 ops/ms 4096 697.671 877.411 1.25 Note that this patch is splitted from https://github.com/openjdk/jdk/pull/24679. A follow-up PR will implement the SVE subword gather load operations after this PR is merged. [1] https://bugs.openjdk.org/browse/JDK-8318650 [2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en [3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en ------------- Commit messages: - 8355563: VectorAPI: Refactor current implementation of subword gather load API Changes: https://git.openjdk.org/jdk/pull/25138/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25138&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8355563 Stats: 441 lines in 15 files changed: 105 ins; 176 del; 160 mod Patch: https://git.openjdk.org/jdk/pull/25138.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25138/head:pull/25138 PR: https://git.openjdk.org/jdk/pull/25138