On Sun, 21 Jan 2024 06:55:43 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
>> partially unrolls scalar loop to accumulates values from gather indices into 
>> a quadword(64bit) slice followed by vector permutation to place the slice 
>> into appropriate vector lanes, it prevents code bloating and generates 
>> compact JIT sequence. This coupled with savings from expansive array 
>> allocation in existing java implementation translates into significant 
>> performance of 1.5-10x gains with included micro on Intel Atom family CPUs 
>> and with JVM option UseAVX=2.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) For AVX512 targets algorithm uses integral gather instructions to load 
>> values from normalized indices which are multiple of integer size, followed 
>> by shuffling and packing exact sub-word values from integral lanes. 
>> 
>> 3) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review comments resolutions.

src/hotspot/cpu/x86/assembler_x86.cpp line 3297:

> 3295: }
> 3296: 
> 3297: // Move Unaligned EVEX enabled Vector (programmable : 8,16,32,64)

The  (programmable : 8,16,32,64) part of the comment could be removed. This is 
not something special here for this instruction. Or at the minimum we should 
say "programmable vector length".

src/hotspot/cpu/x86/assembler_x86.cpp line 13587:

> 13585: }
> 13586: 
> 13587: void Assembler::bt(Register dst, Register src) {

We could name it btq() as it is a 64 bit instruction.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1574:

> 1572: void C2_MacroAssembler::vpackI2X(BasicType elem_bt, XMMRegister dst,
> 1573:                                  XMMRegister ones, XMMRegister xtmp,
> 1574:                                  int vlen_enc) {

The ones and xtmp argument is not used in vpackI2X?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1603:

> 1601:     bitlevel_offset_shift = 3;
> 1602:     nomlarized_index_shift = 2;
> 1603:   }

It takes a lot of effort to understand the logic here. Good to have a comment 
here that we are gathering 32 bit aligned elements first and then extracting 
the required subword from that.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:

> 1611:   vpand(xtmp, idx_vec, xtmp, vlen_enc);
> 1612:   // Load double words from normalized indices.
> 1613:   evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc);

Could we not do here directly:
evpgatherdd(dst, gmask, Address(base, idx_vec, scale), vlen_enc);
Then we dont need lines 1609-1611 and also 1616-1621 as well.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:

> 1611:   vpand(xtmp, idx_vec, xtmp, vlen_enc);
> 1612:   // Load double words from normalized indices.
> 1613:   evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc);

Another question, looks to me that we could read beyond the allocated memory 
for the array here. e.g. consider the following case:
* It is a byte gather
* The byte source array is of size 41, i.e. only indices 0-40 are valid
* The gather index is 40

Then as part of evpgatherdd we would be reading bytes at 40-43 offset from 
source array.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1622:

> 1620:   // 16 bits(for short)/8 bits(for byte) of each double word lane.
> 1621:   vpsrlvd(dst, dst, xtmp, vlen_enc);
> 1622:   // Pack double word vector into short vector.

Pack double word vector into short/byte vector.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473298053
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473305137
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473311064
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473484078
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473486672
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473488519
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473464504

Reply via email to