On Sun, 6 Oct 2024 08:32:20 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:

>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly 
>> just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and 
>> [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed 
>> before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to 
>> missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not 
>> been able to find the root causes. I'm not sure if this is a blocker, now I 
>> cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call 
>> into intrinsics to the concrete class as that may help the compiler know the 
>> correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of 
>> the bit type. Currently, `VectorShuffle` is stored as a byte array, and 
>> would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This 
>> hinders the performance when the shuffle indices are not constant and are 
>> loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems 
>> that a shuffle index vector is always expanded to the correct type before 
>> executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as 
>> special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is 
>> inefficient for FP types since both FP conversions and FP comparisons are 
>> more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains one commit:
> 
>   [vectorapi] Refactor VectorShuffle implementation

This is a nice change. It's as if `VectorShuffle<E>` has a payload of 
`Vector<F>` of the same shape as the shuffle and where` F` is the bit size 
equivalent integral type of `E`, and where the lane elements of the vector are 
constrained to be within `[-VLENGTH, VLENGTH-1]` (I do wonder if we might be 
able to refactor towards that more explicit representation later on with 
Valhalla.)

That simplifies things and opens up more optimizations and complements the 
modifications we recently did to `rearrange`/`selectFrom`.

(Recommend you do a merge with master to get latest Vector API changes just in 
case there is some impact.)

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java 
line 228:

> 226:         }
> 227: 
> 228:         AbstractVector<?> iota = vspecies().asIntegral().iota();

I suspect the non-power of two code is more efficient. (Even better if the MUL 
could be transformed to a shift for power of two values.)

Separately, it makes me wonder if we should revisit the shuffle factories if it 
is now much more efficient to construct a shuffle from a vector.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java 
line 870:

> 868:         @Override
> 869:         public final Int256Shuffle rearrange(VectorShuffle<Integer> 
> shuffle) {
> 870:             return (Int256Shuffle) 
> toBitsVector().rearrange(((Int256Shuffle) shuffle)

I think the cast is redundant for all vector kinds. Similarly the explicit cast 
is redundant for the integral vectors, perhaps in the template separate out the 
expressions to avoid it where not needed?

We could also refer to `VSPECIES` directly rather than calling `vspecies()`, 
same applies in other methods in the concrete vector classes.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java 
line 908:

> 906:         }
> 907: 
> 908:         private static boolean indicesInRange(int[] indices) {

Since this method is only called from an assert statement in the constructor we 
could avoid the clever checking that assertions are enabled and the explicit 
throwing on an AssertionError by using a second expression that produces an 
error message when the assertion fails : e.g.,

            assert indicesInRange(indices) : outOfBoundsAssertMessage(indices);

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 
2392:

> 2390:             this, shuffle, null,
> 2391:             (v1, s_, m_) -> v1.uOp((i, a) -> {
> 2392:                 int ei = Integer.remainderUnsigned(s_.laneSource(i), 
> v1.length());

Note to self - the intrinsic performs the wrapping of shuffle values using 
bitwise AND.
Nice use of method (equiv to `Math.floorMod` for the range on input arguments).

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 
2473:

> 2471:     final <F>
> 2472:     VectorShuffle<F> toShuffle(AbstractSpecies<F> dsp, boolean wrap) {
> 2473:         assert(dsp.elementSize() == vspecies().elementSize());

Even though we force inline I cannot quite decide if it is better to forego the 
assert since it unduly increases method size. Regardless it may be useful to 
place the partial wrapping logic in a separate method, given it is less likely 
to be used.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template
 line 1150:

> 1148:         @Override
> 1149:         @ForceInline
> 1150:         public void intoArray(int[] a, int offset) {

Separately, we might consider optimizing `shuffleFromArray`.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2402948659
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821489034
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821471669
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821478372
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821450485
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821456333
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821501842

Reply via email to