Re: RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v9]

Jatin Bhateja Tue, 10 Dec 2024 00:49:20 -0800

On Tue, 10 Dec 2024 08:16:19 GMT, Quan Anh Mai <[email protected]> wrote:


>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly 
>> just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and 
>> [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed 
>> before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to 
>> missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not 
>> been able to find the root causes. I'm not sure if this is a blocker, now I 
>> cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call 
>> into intrinsics to the concrete class as that may help the compiler know the 
>> correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of 
>> the bit type. Currently, `VectorShuffle` is stored as a byte array, and 
>> would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This 
>> hinders the performance when the shuffle indices are not constant and are 
>> loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems 
>> that a shuffle index vector is always expanded to the correct type before 
>> executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as 
>> special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is 
>> inefficient for FP types since both FP conversions and FP comparisons are 
>> more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Change wording on VectorLoadShuffleNode
>   
>   Co-authored-by: Jatin Bhateja <[email protected]>

I am observing some performance drops in slice / unslice benchmarks,  I have 
just completed this run and not got a chance to root cause.
![image](https://github.com/user-attachments/assets/ecb686b6-c3b0-47e9-9325-40341c19ff9d)

Can you kindly verify once at your end.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21042#issuecomment-2530801116

Re: RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v9]

Reply via email to