JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for 
X86 platforms [1]. However, the current implementation is not optimal for 
AArch64 SVE platform, which natively supports vector instructions for subword 
gather load operations using an int vector for indices (see [2][3]).

Two key areas require improvement:
1. At the Java level, vector indices generated for range validation could be 
reused for the subsequent gather load operation on architectures with native 
vector instructions like AArch64 SVE. However, the current implementation 
prevents compiler reuse of these index vectors due to divergent control flow, 
potentially impacting performance.
2. At the compiler IR level, the additional `offset` input for 
`LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR 
complexity and complicates backend implementation. Furthermore, generating 
`add` instructions before each memory access negatively impacts performance.

This patch refactors the implementation at both the Java level and compiler 
mid-end to improve efficiency and maintainability across different 
architectures.

Main changes:
1. Java-side API refactoring:
   - Explicitly passes generated index vectors to hotspot, eliminating 
duplicate index vectors for gather load instructions on
     architectures like AArch64.
2. C2 compiler IR refactoring:
   - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types 
by removing the memory offset input and incorporating it into the memory base 
`addr` at the IR level. This simplifies backend implementation, reduces add 
operations, and unifies the IR across all types.
3. Backend changes:
   - Streamlines X86 implementation of subword gather operations following the 
removal of the offset input from the IR level.

Performance:
The performance of the relative JMH improves up to 27% on a X86 AVX512 system. 
Please see the data below:

Benchmark                                                 Mode   Cnt Unit    
SIZE    Before      After    Gain
GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  64 
   53682.012   52650.325  0.98
GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
256   14484.252   14255.156  0.98
GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
1024   3664.900    3595.615  0.98
GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  
4096    908.312     935.269  1.02
GatherOperationsBenchmark.microByteGather128_MASK         thrpt  30  ops/ms  64 
   43040.148   44605.580  1.03
GatherOperationsBenchmark.microByteGather128_MASK         thrpt  30  ops/ms  
256   12445.650   12928.102  1.03
GatherOperationsBenchmark.microByteGather128_MASK         thrpt  30  ops/ms  
1024   3143.728    3294.173  1.04
GatherOperationsBenchmark.microByteGather128_MASK         thrpt  30  ops/ms  
4096    801.516     842.951  1.05
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  thrpt  30  ops/ms  64 
   40379.343   45255.490  1.12
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  thrpt  30  ops/ms  
256   11103.537   12971.581  1.16
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  thrpt  30  ops/ms  
1024   2767.870    3299.453  1.19
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  thrpt  30  ops/ms  
4096    704.610     840.908  1.19
GatherOperationsBenchmark.microByteGather128_NZ_OFF       thrpt  30  ops/ms  64 
   49066.340   53365.591  1.08
GatherOperationsBenchmark.microByteGather128_NZ_OFF       thrpt  30  ops/ms  
256   14063.326   14286.067  1.01
GatherOperationsBenchmark.microByteGather128_NZ_OFF       thrpt  30  ops/ms  
1024   3617.992    3621.272  1.00
GatherOperationsBenchmark.microByteGather128_NZ_OFF       thrpt  30  ops/ms  
4096    861.026     938.055  1.08
GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  64 
   55844.814   48311.847  0.86
GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
256   15139.459   13009.848  0.85
GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
1024   3861.834    3284.944  0.85
GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  
4096    938.665     817.673  0.87
GatherOperationsBenchmark.microByteGather256_MASK         thrpt  30  ops/ms  64 
   43942.924   43144.065  0.98
GatherOperationsBenchmark.microByteGather256_MASK         thrpt  30  ops/ms  
256   12461.170   11580.981  0.92
GatherOperationsBenchmark.microByteGather256_MASK         thrpt  30  ops/ms  
1024   3168.598    2945.698  0.92
GatherOperationsBenchmark.microByteGather256_MASK         thrpt  30  ops/ms  
4096    803.515     738.049  0.91
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  thrpt  30  ops/ms  64 
   42197.440   43209.913  1.02
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  thrpt  30  ops/ms  
256   11456.761   11713.265  1.02
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  thrpt  30  ops/ms  
1024   2732.576    2949.724  1.07
GatherOperationsBenchmark.microByteGather256_MASK_NZ_OFF  thrpt  30  ops/ms  
4096    726.062     744.774  1.02
GatherOperationsBenchmark.microByteGather256_NZ_OFF       thrpt  30  ops/ms  64 
   52915.781   49520.027  0.93
GatherOperationsBenchmark.microByteGather256_NZ_OFF       thrpt  30  ops/ms  
256   14481.921   13496.835  0.93
GatherOperationsBenchmark.microByteGather256_NZ_OFF       thrpt  30  ops/ms  
1024   3632.065    3362.372  0.92
GatherOperationsBenchmark.microByteGather256_NZ_OFF       thrpt  30  ops/ms  
4096    892.825     845.809  0.94
GatherOperationsBenchmark.microByteGather512              thrpt  30  ops/ms  64 
   54528.404   54478.751  0.99
GatherOperationsBenchmark.microByteGather512              thrpt  30  ops/ms  
256   15018.181   14673.727  0.97
GatherOperationsBenchmark.microByteGather512              thrpt  30  ops/ms  
1024   3824.690    3589.530  0.93
GatherOperationsBenchmark.microByteGather512              thrpt  30  ops/ms  
4096    923.601     906.245  0.98
GatherOperationsBenchmark.microByteGather512_MASK         thrpt  30  ops/ms  64 
   41248.192   42201.455  1.02
GatherOperationsBenchmark.microByteGather512_MASK         thrpt  30  ops/ms  
256   11481.408   11559.655  1.00
GatherOperationsBenchmark.microByteGather512_MASK         thrpt  30  ops/ms  
1024   2901.592    2912.954  1.00
GatherOperationsBenchmark.microByteGather512_MASK         thrpt  30  ops/ms  
4096    732.899     730.381  0.99
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  thrpt  30  ops/ms  64 
   42287.123   43779.227  1.03
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  thrpt  30  ops/ms  
256   11486.167   11448.966  0.99
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  thrpt  30  ops/ms  
1024   2888.047    2928.612  1.01
GatherOperationsBenchmark.microByteGather512_MASK_NZ_OFF  thrpt  30  ops/ms  
4096    731.056     738.300  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       thrpt  30  ops/ms  64 
   51777.670   54368.797  1.05
GatherOperationsBenchmark.microByteGather512_NZ_OFF       thrpt  30  ops/ms  
256   14558.532   14662.164  1.00
GatherOperationsBenchmark.microByteGather512_NZ_OFF       thrpt  30  ops/ms  
1024   3726.910    3714.448  0.99
GatherOperationsBenchmark.microByteGather512_NZ_OFF       thrpt  30  ops/ms  
4096    907.863     903.544  0.99
GatherOperationsBenchmark.microByteGather64               thrpt  30  ops/ms  64 
   52980.507   54970.689  1.03
GatherOperationsBenchmark.microByteGather64               thrpt  30  ops/ms  
256   15044.443   15828.237  1.05
GatherOperationsBenchmark.microByteGather64               thrpt  30  ops/ms  
1024   3869.028    4098.172  1.05
GatherOperationsBenchmark.microByteGather64               thrpt  30  ops/ms  
4096    912.372    1002.065  1.09
GatherOperationsBenchmark.microByteGather64_MASK          thrpt  30  ops/ms  64 
   44267.641   45864.381  1.03
GatherOperationsBenchmark.microByteGather64_MASK          thrpt  30  ops/ms  
256   12303.206   12920.113  1.05
GatherOperationsBenchmark.microByteGather64_MASK          thrpt  30  ops/ms  
1024   3100.867    3115.636  1.00
GatherOperationsBenchmark.microByteGather64_MASK          thrpt  30  ops/ms  
4096    792.004     832.623  1.05
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   thrpt  30  ops/ms  64 
   40417.638   45844.634  1.13
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   thrpt  30  ops/ms  
256   11628.508   12913.170  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   thrpt  30  ops/ms  
1024   2911.508    3260.388  1.11
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   thrpt  30  ops/ms  
4096    709.017     835.084  1.17
GatherOperationsBenchmark.microByteGather64_NZ_OFF        thrpt  30  ops/ms  64 
   48868.987   53585.210  1.09
GatherOperationsBenchmark.microByteGather64_NZ_OFF        thrpt  30  ops/ms  
256   13617.963   15754.029  1.15
GatherOperationsBenchmark.microByteGather64_NZ_OFF        thrpt  30  ops/ms  
1024   3504.745    3857.926  1.10
GatherOperationsBenchmark.microByteGather64_NZ_OFF        thrpt  30  ops/ms  
4096    818.439     958.751  1.17
GatherOperationsBenchmark.microShortGather128             thrpt  30  ops/ms  64 
   41351.719   44337.947  1.07
GatherOperationsBenchmark.microShortGather128             thrpt  30  ops/ms  
256   11175.501   12302.557  1.10
GatherOperationsBenchmark.microShortGather128             thrpt  30  ops/ms  
1024   2854.546    3158.973  1.10
GatherOperationsBenchmark.microShortGather128             thrpt  30  ops/ms  
4096    744.816     790.304  1.06
GatherOperationsBenchmark.microShortGather128_MASK        thrpt  30  ops/ms  64 
   35012.934   35728.068  1.02
GatherOperationsBenchmark.microShortGather128_MASK        thrpt  30  ops/ms  
256    9408.162    9854.849  1.04
GatherOperationsBenchmark.microShortGather128_MASK        thrpt  30  ops/ms  
1024   2352.723    2489.161  1.05
GatherOperationsBenchmark.microShortGather128_MASK        thrpt  30  ops/ms  
4096    595.827     634.225  1.06
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt  30  ops/ms  64 
   31405.646   35728.077  1.13
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt  30  ops/ms  
256    8459.702    9865.482  1.16
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt  30  ops/ms  
1024   2095.461    2489.927  1.18
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF thrpt  30  ops/ms  
4096    535.715     631.614  1.17
GatherOperationsBenchmark.microShortGather128_NZ_OFF      thrpt  30  ops/ms  64 
   39996.604   43811.259  1.09
GatherOperationsBenchmark.microShortGather128_NZ_OFF      thrpt  30  ops/ms  
256   11058.636   12261.463  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF      thrpt  30  ops/ms  
1024   2847.482    3157.450  1.10
GatherOperationsBenchmark.microShortGather128_NZ_OFF      thrpt  30  ops/ms  
4096    712.089     790.143  1.10
GatherOperationsBenchmark.microShortGather256             thrpt  30  ops/ms  64 
   51893.730   51975.295  1.00
GatherOperationsBenchmark.microShortGather256             thrpt  30  ops/ms  
256   14226.104   14720.390  1.03
GatherOperationsBenchmark.microShortGather256             thrpt  30  ops/ms  
1024   3491.958    3714.266  1.06
GatherOperationsBenchmark.microShortGather256             thrpt  30  ops/ms  
4096    852.278     905.330  1.06
GatherOperationsBenchmark.microShortGather256_MASK        thrpt  30  ops/ms  64 
   38736.351   41797.516  1.07
GatherOperationsBenchmark.microShortGather256_MASK        thrpt  30  ops/ms  
256   10250.508   11790.235  1.15
GatherOperationsBenchmark.microShortGather256_MASK        thrpt  30  ops/ms  
1024   2558.449    2956.936  1.15
GatherOperationsBenchmark.microShortGather256_MASK        thrpt  30  ops/ms  
4096    648.882     745.885  1.14
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt  30  ops/ms  64 
   38315.594   39547.847  1.03
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt  30  ops/ms  
256   10471.955   11779.499  1.12
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt  30  ops/ms  
1024   2618.623    2679.970  1.02
GatherOperationsBenchmark.microShortGather256_MASK_NZ_OFF thrpt  30  ops/ms  
4096    655.803     760.392  1.15
GatherOperationsBenchmark.microShortGather256_NZ_OFF      thrpt  30  ops/ms  64 
   47674.080   51325.185  1.07
GatherOperationsBenchmark.microShortGather256_NZ_OFF      thrpt  30  ops/ms  
256   13446.700   14438.516  1.07
GatherOperationsBenchmark.microShortGather256_NZ_OFF      thrpt  30  ops/ms  
1024   3371.433    3664.720  1.08
GatherOperationsBenchmark.microShortGather256_NZ_OFF      thrpt  30  ops/ms  
4096    814.540     895.182  1.09
GatherOperationsBenchmark.microShortGather512             thrpt  30  ops/ms  64 
   48183.553   48374.790  1.01
GatherOperationsBenchmark.microShortGather512             thrpt  30  ops/ms  
256   13669.806   12940.433  0.94
GatherOperationsBenchmark.microShortGather512             thrpt  30  ops/ms  
1024   3371.708    3318.627  0.98
GatherOperationsBenchmark.microShortGather512             thrpt  30  ops/ms  
4096    847.620     805.313  0.95
GatherOperationsBenchmark.microShortGather512_MASK        thrpt  30  ops/ms  64 
   39566.443   42845.296  1.08
GatherOperationsBenchmark.microShortGather512_MASK        thrpt  30  ops/ms  
256   11926.440   10308.223  0.86
GatherOperationsBenchmark.microShortGather512_MASK        thrpt  30  ops/ms  
1024   3008.542    2546.197  0.84
GatherOperationsBenchmark.microShortGather512_MASK        thrpt  30  ops/ms  
4096    764.497     647.276  0.84
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt  30  ops/ms  64 
   38106.800   42835.120  1.12
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt  30  ops/ms  
256   10405.171   11125.164  1.06
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt  30  ops/ms  
1024   2526.827    2799.209  1.10
GatherOperationsBenchmark.microShortGather512_MASK_NZ_OFF thrpt  30  ops/ms  
4096    655.044     715.519  1.09
GatherOperationsBenchmark.microShortGather512_NZ_OFF      thrpt  30  ops/ms  64 
   48108.682   46654.427  0.96
GatherOperationsBenchmark.microShortGather512_NZ_OFF      thrpt  30  ops/ms  
256   13197.197   12957.497  0.98
GatherOperationsBenchmark.microShortGather512_NZ_OFF      thrpt  30  ops/ms  
1024   3397.959    3244.415  0.95
GatherOperationsBenchmark.microShortGather512_NZ_OFF      thrpt  30  ops/ms  
4096    824.034     820.536  0.99
GatherOperationsBenchmark.microShortGather64              thrpt  30  ops/ms  64 
   44815.622   46913.289  1.04
GatherOperationsBenchmark.microShortGather64              thrpt  30  ops/ms  
256   12317.166   13536.731  1.09
GatherOperationsBenchmark.microShortGather64              thrpt  30  ops/ms  
1024   3157.683    3539.991  1.12
GatherOperationsBenchmark.microShortGather64              thrpt  30  ops/ms  
4096    775.626     878.304  1.13
GatherOperationsBenchmark.microShortGather64_MASK         thrpt  30  ops/ms  64 
   37064.157   35649.776  0.96
GatherOperationsBenchmark.microShortGather64_MASK         thrpt  30  ops/ms  
256   10120.291   9403.1319  0.92
GatherOperationsBenchmark.microShortGather64_MASK         thrpt  30  ops/ms  
1024   2546.723    2642.781  1.03
GatherOperationsBenchmark.microShortGather64_MASK         thrpt  30  ops/ms  
4096    644.270     648.432  1.00
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  thrpt  30  ops/ms  64 
   34386.819   37883.550  1.10
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  thrpt  30  ops/ms  
256    9316.097   10500.473  1.12
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  thrpt  30  ops/ms  
1024   2344.570    2643.114  1.12
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  thrpt  30  ops/ms  
4096    594.445     595.301  1.00
GatherOperationsBenchmark.microShortGather64_NZ_OFF       thrpt  30  ops/ms  64 
   40240.772   48435.477  1.20
GatherOperationsBenchmark.microShortGather64_NZ_OFF       thrpt  30  ops/ms  
256   11082.392   13736.985  1.23
GatherOperationsBenchmark.microShortGather64_NZ_OFF       thrpt  30  ops/ms  
1024   2777.065    3549.704  1.27
GatherOperationsBenchmark.microShortGather64_NZ_OFF       thrpt  30  ops/ms  
4096    697.671     877.411  1.25



Note that this patch is splitted from 
https://github.com/openjdk/jdk/pull/24679. A follow-up PR will implement the 
SVE subword gather load operations after this PR
is merged.

[1] https://bugs.openjdk.org/browse/JDK-8318650
[2] 
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector---Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] 
https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en

-------------

Commit messages:
 - 8355563: VectorAPI: Refactor current implementation of subword gather load 
API

Changes: https://git.openjdk.org/jdk/pull/25138/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25138&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8355563
  Stats: 441 lines in 15 files changed: 105 ins; 176 del; 160 mod
  Patch: https://git.openjdk.org/jdk/pull/25138.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25138/head:pull/25138

PR: https://git.openjdk.org/jdk/pull/25138

Reply via email to