On Mon, 16 Sep 2024 02:58:41 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support >> for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the >> values stored in first (v1) and second (v2) vector arguments. Thus, first >> and second vector serves as a table, whose elements are selected based on >> index value vector. API is applicable to all integral and floating-point >> types. The result of this operation is semantically equivalent to >> expression v1.rearrange(this.toShuffle(), v2). Values held in index vector >> lanes must lie within valid two vector index range [0, 2*VLEN) else an >> IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a >> lowering transformation dismantles new IR into constituent IR supported by >> target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing >> rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt >> Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 >> ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 >> ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 >> ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 >> ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 >> ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 >> ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 >> ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 >> ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 >> ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 >> ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 >> ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Disabling VectorLoadShuffle bypassing optimization to comply with rearrange > semantics at IR level. Looks better, I still have a few comments. src/hotspot/share/opto/vectorIntrinsics.cpp line 2739: > 2737: return true; > 2738: } > 2739: @jatin-bhateja You still have 3x `unbox failed v1` here. I already commented this earlier, and you resolved it and gave it a thumbs up 🙈 Can you please fix it now? src/hotspot/share/opto/vectornode.cpp line 2116: > 2114: const TypeVect* index_vect_type = index_vec->bottom_type()->is_vect(); > 2115: BasicType index_elem_bt = index_vect_type->element_basic_type(); > 2116: assert(!is_floating_point_type(index_elem_bt), ""); Why not verify this also in the constructor of `SelectFromTwoVectorNode`? Can you maybe explicitly verify what it must be rather than **not** be? src/hotspot/share/opto/vectornode.cpp line 2122: > 2120: // index format by subsequent VectorLoadShuffle. > 2121: int cast_vopc = VectorCastNode::opcode(0, index_elem_bt, true); > 2122: Node* index_byte_vec = > phase->transform(VectorCastNode::make(cast_vopc, index_vec, T_BYTE, > num_elem)); This cast assumes that the indices cannot have more than 8 bits. This would allow vector lengths of up to 256. This is fine for intel. But as far as I know ARM has in principle longer vectors - up to 2048 bytes. Should we maybe add some assert here to make sure we never badly truncate the index? src/hotspot/share/opto/vectornode.cpp line 2138: > 2136: > 2137: // Load indexes from byte vector and appropriatly massage them to > target specific > 2138: // permutation index format. I would replace `massage` -> `transform` everywhere. src/hotspot/share/opto/vectornode.hpp line 1625: > 1623: Node* Ideal(PhaseGVN* phase, bool can_reshape); > 1624: virtual int Opcode() const; > 1625: }; `index` -> `indexes` because this is a vector, right? Otherwise I'll assume it is a scalar. Can you do some pseudo-code, that says how exactly the indices are interpreted? What if they are out of bounds? Does it wrap? Or assume they are in bounds? Undefined behaviour? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2305905569 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760651336 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760674461 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760678107 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760680772 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760665944