On Tue, 17 Sep 2024 16:13:55 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
> Hi, > > This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly just > the revert of the backout. > > Regarding the related issues: > > - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and > [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed > before the backout. > - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to > missing `ForceInline` on `AbstractVector::toBitsVectorTemplate` > - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not been > able to find the root causes. I'm not sure if this is a blocker, now I cannot > even build x86-32 tests. > > Finally, I moved some implementation of public methods and methods that call > into intrinsics to the concrete class as that may help the compiler know the > correct types of the variables. > > Please take a look and leave reviews. Thanks a lot. > > The description of the original PR: > > This patch reimplements `VectorShuffle` implementations to be a vector of the > bit type. Currently, `VectorShuffle` is stored as a byte array, and would be > expanded upon usage. This poses several drawbacks: > > Inefficient conversions between a shuffle and its corresponding vector. This > hinders the performance when the shuffle indices are not constant and are > loaded or computed dynamically. > Redundant expansions in `rearrange` operations. On all platforms, it seems > that a shuffle index vector is always expanded to the correct type before > executing the `rearrange` operations. > Some redundant intrinsics are needed to support this handling as well as > special considerations in the C2 compiler. > Range checks are performed using `VectorShuffle::toVector`, which is > inefficient for FP types since both FP conversions and FP comparisons are > more expensive than the integral ones. > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {ex... > Got it, I think #20508 and this PR are unrelated implementation-wise, though. It would be nice if we can move independently of #20508 as that may take longer to integrate because of API/CSR review. > > @jatin-bhateja What do you think of using this patch and intrinsifing > `Vector<E>::rearrange(VectorShuffle<E>, Vector<E>)` instead of introducing > the 2 vector `selectFrom` API? IMO the two-vector `selectFrom` API is complementary to the existing single-vector `selectFrom`, and both have equivalent `rearrange` expressions. For either use we should ideally get to the point that a similar/identical optimal instruction sequence is generated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21042#issuecomment-2362388082