On Mon, 16 Sep 2024 02:58:41 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi All,
>> 
>> As per the discussion on panama-dev mailing list[1], patch adds the support 
>> for following new two vector permutation APIs.
>> 
>> 
>> Declaration:-
>>     Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
>> 
>> 
>> Semantics:-
>>     Using index values stored in the lanes of "this" vector, assemble the 
>> values stored in first (v1) and second (v2) vector arguments. Thus, first 
>> and second vector serves as a table, whose elements are selected based on 
>> index value vector. API is applicable to all integral and floating-point 
>> types.  The result of this operation is semantically equivalent to 
>> expression v1.rearrange(this.toShuffle(), v2). Values held in index vector 
>> lanes must lie within valid two vector index range [0, 2*VLEN) else an 
>> IndexOutOfBoundException is thrown.  
>> 
>> Summary of changes:
>> -  Java side implementation of new selectFrom API.
>> -  C2 compiler IR and inline expander changes.
>> -  In absence of direct two vector permutation instruction in target ISA, a 
>> lowering transformation dismantles new IR into constituent IR supported by 
>> target platforms. 
>> -  Optimized x86 backend implementation for AVX512 and legacy target.
>> -  Function tests covering new API.
>> 
>> JMH micro included with this patch shows around 10-15x gain over existing 
>> rearrange API :-
>> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
>> 
>> 
>>   Benchmark                                     (size)   Mode  Cnt      
>> Score   Error   Units
>> SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459  
>>         ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556  
>>         ops/ms
>> SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830  
>>         ops/ms
>> SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174  
>>         ops/ms
>> SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804  
>>         ops/ms
>> S...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Disabling VectorLoadShuffle bypassing optimization to comply with rearrange 
> semantics at IR level.

Looks better, I still have a few comments.

src/hotspot/share/opto/vectorIntrinsics.cpp line 2739:

> 2737:   return true;
> 2738: }
> 2739: 

@jatin-bhateja You still have 3x `unbox failed v1` here. I already commented 
this earlier, and you resolved it and gave it a thumbs up 🙈 
Can you please fix it now?

src/hotspot/share/opto/vectornode.cpp line 2116:

> 2114:   const TypeVect* index_vect_type = index_vec->bottom_type()->is_vect();
> 2115:   BasicType index_elem_bt = index_vect_type->element_basic_type();
> 2116:   assert(!is_floating_point_type(index_elem_bt), "");

Why not verify this also in the constructor of `SelectFromTwoVectorNode`? Can 
you maybe explicitly verify what it must be rather than **not** be?

src/hotspot/share/opto/vectornode.cpp line 2122:

> 2120:   // index format by subsequent VectorLoadShuffle.
> 2121:   int cast_vopc = VectorCastNode::opcode(0, index_elem_bt, true);
> 2122:   Node* index_byte_vec = 
> phase->transform(VectorCastNode::make(cast_vopc, index_vec, T_BYTE, 
> num_elem));

This cast assumes that the indices cannot have more than 8 bits. This would 
allow vector lengths of up to 256. This is fine for intel. But as far as I know 
ARM has in principle longer vectors - up to 2048 bytes. Should we maybe add 
some assert here to make sure we never badly truncate the index?

src/hotspot/share/opto/vectornode.cpp line 2138:

> 2136: 
> 2137:   // Load indexes from byte vector and appropriatly massage them to 
> target specific
> 2138:   // permutation index format.

I would replace `massage` -> `transform` everywhere.

src/hotspot/share/opto/vectornode.hpp line 1625:

> 1623:   Node* Ideal(PhaseGVN* phase, bool can_reshape);
> 1624:   virtual int Opcode() const;
> 1625: };

`index` -> `indexes` because this is a vector, right? Otherwise I'll assume it 
is a scalar.
Can you do some pseudo-code, that says how exactly the indices are interpreted? 
What if they are out of bounds? Does it wrap? Or assume they are in bounds? 
Undefined behaviour?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2305905569
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760651336
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760674461
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760678107
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760680772
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760665944

Reply via email to