On Fri, 6 Dec 2024 17:00:15 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> Hi, >> >> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly >> just the revert of the backout. >> >> Regarding the related issues: >> >> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and >> [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed >> before the backout. >> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to >> missing `ForceInline` on `AbstractVector::toBitsVectorTemplate` >> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not >> been able to find the root causes. I'm not sure if this is a blocker, now I >> cannot even build x86-32 tests. >> >> Finally, I moved some implementation of public methods and methods that call >> into intrinsics to the concrete class as that may help the compiler know the >> correct types of the variables. >> >> Please take a look and leave reviews. Thanks a lot. >> >> The description of the original PR: >> >> This patch reimplements `VectorShuffle` implementations to be a vector of >> the bit type. Currently, `VectorShuffle` is stored as a byte array, and >> would be expanded upon usage. This poses several drawbacks: >> >> Inefficient conversions between a shuffle and its corresponding vector. This >> hinders the performance when the shuffle indices are not constant and are >> loaded or computed dynamically. >> Redundant expansions in `rearrange` operations. On all platforms, it seems >> that a shuffle index vector is always expanded to the correct type before >> executing the `rearrange` operations. >> Some redundant intrinsics are needed to support this handling as well as >> special considerations in the C2 compiler. >> Range checks are performed using `VectorShuffle::toVector`, which is >> inefficient for FP types since both FP conversions and FP comparisons are >> more expensive than the integral ones. >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byt... > > Quan Anh Mai has updated the pull request incrementally with one additional > commit since the last revision: > > add comment, extract cast into local variable @merykitty looks quite reasonable, though I only looked at the VM changes, only scanned the Java library code. src/hotspot/cpu/x86/x86.ad line 2215: > 2213: > 2214: // Return true if Vector::rearrange needs preparation of the shuffle > argument > 2215: bool Matcher::vector_needs_load_shuffle(BasicType elem_bt, int vlen) { I commented on this before. This needs to have a more expressive name. Is it just about rearrange? Because now it sounds like maybe all vectors may need a shuffle. Or just all loads? Confusing. src/hotspot/share/opto/vectorIntrinsics.cpp line 1757: > 1755: > 1756: int num_elem = vlen->get_con(); > 1757: bool need_load_shuffle = > Matcher::vector_needs_load_shuffle(shuffle_bt, num_elem); Maybe it could be renamed to `vector_rearrange_requires_load_shuffle`? src/hotspot/share/opto/vectorIntrinsics.cpp line 1800: > 1798: Node* v1 = unbox_vector(argument(5), vbox_type, elem_bt, num_elem); > 1799: Node* shuffle = unbox_vector(argument(6), shbox_type, shuffle_bt, > num_elem); > 1800: const TypeVect* vt = TypeVect::make(elem_bt, num_elem); Is this one used? src/hotspot/share/opto/vectornode.hpp line 1694: > 1692: // we can transform the rearrange into a different element type. For > example, on x86 before AVX512, > 1693: // there is no rearrange instruction for short elements, what we will > then do is to transform the > 1694: // shuffle vector into 1 that we can do byte rearrange such that it > would provide the same result. Suggestion: // shuffle vector into one that we can do byte rearrange such that it would provide the same result. src/hotspot/share/opto/vectornode.hpp line 1696: > 1694: // shuffle vector into 1 that we can do byte rearrange such that it > would provide the same result. > 1695: // This can be done in VectorRearrangeNode during code emission but we > eagerly expand out this > 1696: // because it is often the case that an index vector is reused in many > rearrange operations. Thanks for this explanation! `This can be done in VectorRearrangeNode` -> **could have** be done, because we now don't do it, right? What do you mean by `expand out this`? Do you mean we have a separate dedicated node, so that it could possibly fold away with other nodes, such as index vector? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2487987908 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875503753 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875517727 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875519134 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875537761 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875540023