On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

> Hi All,
> 
> As per the discussion on panama-dev mailing list[1], patch adds the support 
> for following new two vector permutation APIs.
> 
> 
> Declaration:-
>     Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
> 
> 
> Semantics:-
>     Using index values stored in the lanes of "this" vector, assemble the 
> values stored in first (v1) and second (v2) vector arguments. Thus, first and 
> second vector serves as a table, whose elements are selected based on index 
> value vector. API is applicable to all integral and floating-point types.  
> The result of this operation is semantically equivalent to expression 
> v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must 
> lie within valid two vector index range [0, 2*VLEN) else an 
> IndexOutOfBoundException is thrown.  
> 
> Summary of changes:
> -  Java side implementation of new selectFrom API.
> -  C2 compiler IR and inline expander changes.
> -  In absence of direct two vector permutation instruction in target ISA, a 
> lowering transformation dismantles new IR into constituent IR supported by 
> target platforms. 
> -  Optimized x86 backend implementation for AVX512 and legacy target.
> -  Function tests covering new API.
> 
> JMH micro included with this patch shows around 10-15x gain over existing 
> rearrange API :-
> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
> 
> 
>   Benchmark                                     (size)   Mode  Cnt      Score 
>   Error   Units
> SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762   
>        ops/ms
> SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550   
>        ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605   
>        ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004   
>        ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758   
>        ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192   
>        ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459   
>        ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556   
>        ops/ms
> SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830   
>        ops/ms
> SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174   
>        ops/ms
> SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804   
>        ops/ms
> SelectFromBenchmark.selectFromIntVector         2048  thrpt    2   5398.2...

The results look promising. I can provide guidance on the specification e.g., 
we can specify the behavior in terms of rearrange, with the addition of 
throwing on out of bounds indexes.

Regarding the throwing of exceptions, some wider context will help to know 
where we are heading before we finalize the specification. I believe we are 
considering changing the default throwing behavior for index out of bounds to 
wrapping, thereby we can avoid bounds checks. If that is the case we should 
wait until that is done then update rather than submitting a CSR just yet?

I see you created a specific intrinsic, which will avoid the cost of shuffle 
creation. Should we apply the same approach (in a subsequent PR) to the single 
argument shuffle? Or perhaps if we manage to optimize shuffles and change the 
default wrapping we don't require a specific intrinsic and can just use defer 
to rearrange?

-------------

PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2234095541

Reply via email to