On Sun, 6 Oct 2024 08:32:20 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> Hi, >> >> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly >> just the revert of the backout. >> >> Regarding the related issues: >> >> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and >> [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed >> before the backout. >> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to >> missing `ForceInline` on `AbstractVector::toBitsVectorTemplate` >> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not >> been able to find the root causes. I'm not sure if this is a blocker, now I >> cannot even build x86-32 tests. >> >> Finally, I moved some implementation of public methods and methods that call >> into intrinsics to the concrete class as that may help the compiler know the >> correct types of the variables. >> >> Please take a look and leave reviews. Thanks a lot. >> >> The description of the original PR: >> >> This patch reimplements `VectorShuffle` implementations to be a vector of >> the bit type. Currently, `VectorShuffle` is stored as a byte array, and >> would be expanded upon usage. This poses several drawbacks: >> >> Inefficient conversions between a shuffle and its corresponding vector. This >> hinders the performance when the shuffle indices are not constant and are >> loaded or computed dynamically. >> Redundant expansions in `rearrange` operations. On all platforms, it seems >> that a shuffle index vector is always expanded to the correct type before >> executing the `rearrange` operations. >> Some redundant intrinsics are needed to support this handling as well as >> special considerations in the C2 compiler. >> Range checks are performed using `VectorShuffle::toVector`, which is >> inefficient for FP types since both FP conversions and FP comparisons are >> more expensive than the integral ones. >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byt... > > Quan Anh Mai has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains one commit: > > [vectorapi] Refactor VectorShuffle implementation This is a nice change. It's as if `VectorShuffle<E>` has a payload of `Vector<F>` of the same shape as the shuffle and where` F` is the bit size equivalent integral type of `E`, and where the lane elements of the vector are constrained to be within `[-VLENGTH, VLENGTH-1]` (I do wonder if we might be able to refactor towards that more explicit representation later on with Valhalla.) That simplifies things and opens up more optimizations and complements the modifications we recently did to `rearrange`/`selectFrom`. (Recommend you do a merge with master to get latest Vector API changes just in case there is some impact.) src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 228: > 226: } > 227: > 228: AbstractVector<?> iota = vspecies().asIntegral().iota(); I suspect the non-power of two code is more efficient. (Even better if the MUL could be transformed to a shift for power of two values.) Separately, it makes me wonder if we should revisit the shuffle factories if it is now much more efficient to construct a shuffle from a vector. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java line 870: > 868: @Override > 869: public final Int256Shuffle rearrange(VectorShuffle<Integer> > shuffle) { > 870: return (Int256Shuffle) > toBitsVector().rearrange(((Int256Shuffle) shuffle) I think the cast is redundant for all vector kinds. Similarly the explicit cast is redundant for the integral vectors, perhaps in the template separate out the expressions to avoid it where not needed? We could also refer to `VSPECIES` directly rather than calling `vspecies()`, same applies in other methods in the concrete vector classes. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java line 908: > 906: } > 907: > 908: private static boolean indicesInRange(int[] indices) { Since this method is only called from an assert statement in the constructor we could avoid the clever checking that assertions are enabled and the explicit throwing on an AssertionError by using a second expression that produces an error message when the assertion fails : e.g., assert indicesInRange(indices) : outOfBoundsAssertMessage(indices); src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 2392: > 2390: this, shuffle, null, > 2391: (v1, s_, m_) -> v1.uOp((i, a) -> { > 2392: int ei = Integer.remainderUnsigned(s_.laneSource(i), > v1.length()); Note to self - the intrinsic performs the wrapping of shuffle values using bitwise AND. Nice use of method (equiv to `Math.floorMod` for the range on input arguments). src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 2473: > 2471: final <F> > 2472: VectorShuffle<F> toShuffle(AbstractSpecies<F> dsp, boolean wrap) { > 2473: assert(dsp.elementSize() == vspecies().elementSize()); Even though we force inline I cannot quite decide if it is better to forego the assert since it unduly increases method size. Regardless it may be useful to place the partial wrapping logic in a separate method, given it is less likely to be used. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1150: > 1148: @Override > 1149: @ForceInline > 1150: public void intoArray(int[] a, int offset) { Separately, we might consider optimizing `shuffleFromArray`. ------------- PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2402948659 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821489034 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821471669 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821478372 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821450485 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821456333 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821501842