Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v8]

erifan Sat, 18 Oct 2025 03:46:21 -0700

On Wed, 20 Aug 2025 10:11:47 GMT, Jatin Bhateja <[email protected]> wrote:


>> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR 
>> instruction.
>> It also adds a new hybrid call generator to facilitate lazy intrinsification 
>> or else perform procedural inlining to prevent call overhead and boxing 
>> penalties in case the fallback implementation expects to operate over 
>> vectors. The existing vector API-based slice implementation is now the 
>> fallback code that gets inlined in case intrinsification fails.
>> 
>>  Idea here is to add infrastructure support to enable intrinsification of 
>> fast path for selected vector APIs, else enable inlining of fall-back 
>> implementation if it's based on vector APIs. Existing call generators like 
>> PredictedCallGenerator, used to handle bi-morphic inlining, already make use 
>> of multiple call generators to handle hit/miss scenarios for a particular 
>> receiver type. The newly added hybrid call generator is lazy and called 
>> during incremental inlining optimization. It also relieves the inline 
>> expander to handle slow paths, which can easily be implemented library side 
>> (Java).
>> 
>> Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
>> 
>> Performance numbers:
>> 
>> 
>> System : 13th Gen Intel(R) Core(TM) i3-1315U
>> 
>> Baseline:
>> Benchmark                                                (size)   Mode  Cnt  
>>     Score   Error   Units
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  9444.444          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  
>> 10009.319          ops/ms
>> VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2  
>>  9081.926          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  
>>  6085.825          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  
>>  6505.378          ops/ms
>> VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2  
>>  6204.489          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2  
>>  1651.334          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2  
>>  1642.784          ops/ms
>> VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2  
>>  1474.808          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  
>> 10399.394          ops/ms
>> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  
>> 10502.894          ops/ms
>> VectorSliceB...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update callGenerator.hpp copyright year

src/hotspot/share/classfile/vmIntrinsics.hpp line 1178:

> 1176:                                    
> "Ljdk/internal/vm/vector/VectorSupport$Vector;"                               
>                               \
> 1177:                                    
> "Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)"                       
>                               \
> 1178:                                    
> "Ljdk/internal/vm/vector/VectorSupport$Vector;")                              
>                        \

Seems this `` is not aligned ?

src/hotspot/share/classfile/vmIntrinsics.hpp line 1179:

> 1177:                                    
> "Ljdk/internal/vm/vector/VectorSupport$VectorSliceOp;)"                       
>                               \
> 1178:                                    
> "Ljdk/internal/vm/vector/VectorSupport$Vector;")                              
>                        \
> 1179:    do_name(vector_slice_name, "sliceOp")                                
>                                                                          \

ditto

test/hotspot/jtreg/compiler/vectorapi/TestSliceOptValueTransforms.java line 45:

> 43:     public static final VectorSpecies<Short> SSP = 
> ShortVector.SPECIES_PREFERRED;
> 44:     public static final VectorSpecies<Integer> ISP = 
> IntVector.SPECIES_PREFERRED;
> 45:     public static final VectorSpecies<Long> LSP = 
> LongVector.SPECIES_PREFERRED;

The implementation supports floating point types, but why doesn't the test 
include fp types?

test/hotspot/jtreg/compiler/vectorapi/TestSliceOptValueTransforms.java line 122:

> 120:                       .intoArray(bdst, i);
> 121:         }
> 122:     }

Since this optimization also benefits the slice variant with mask, could you 
add some tests for it as well?

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorSliceBenchmark.java 
line 59:

> 57:     static final VectorSpecies<Short> sspecies   = 
> ShortVector.SPECIES_PREFERRED;
> 58:     static final VectorSpecies<Integer> ispecies = 
> IntVector.SPECIES_PREFERRED;
> 59:     static final VectorSpecies<Long> lspecies    = 
> LongVector.SPECIES_PREFERRED;

Ditto, no fp types ?

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorSliceBenchmark.java 
line 133:

> 131:                       .intoArray(bdst, i);
> 132:         }
> 133:     }

Ditto, add a benchmark for the slice variant with mask ?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378092410
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378093047
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378310217
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378337340
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378312763
PR Review Comment: https://git.openjdk.org/jdk/pull/24104#discussion_r2378342519

Re: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v8]

Reply via email to