On Sun, 21 Jan 2024 06:55:43 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >> partially unrolls scalar loop to accumulates values from gather indices into >> a quadword(64bit) slice followed by vector permutation to place the slice >> into appropriate vector lanes, it prevents code bloating and generates >> compact JIT sequence. This coupled with savings from expansive array >> allocation in existing java implementation translates into significant >> performance of 1.5-10x gains with included micro on Intel Atom family CPUs >> and with JVM option UseAVX=2. >> >>  >> >> >> 2) For AVX512 targets algorithm uses integral gather instructions to load >> values from normalized indices which are multiple of integer size, followed >> by shuffling and packing exact sub-word values from integral lanes. >> >> 3) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/assembler_x86.cpp line 3297: > 3295: } > 3296: > 3297: // Move Unaligned EVEX enabled Vector (programmable : 8,16,32,64) The (programmable : 8,16,32,64) part of the comment could be removed. This is not something special here for this instruction. Or at the minimum we should say "programmable vector length". src/hotspot/cpu/x86/assembler_x86.cpp line 13587: > 13585: } > 13586: > 13587: void Assembler::bt(Register dst, Register src) { We could name it btq() as it is a 64 bit instruction. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1574: > 1572: void C2_MacroAssembler::vpackI2X(BasicType elem_bt, XMMRegister dst, > 1573: XMMRegister ones, XMMRegister xtmp, > 1574: int vlen_enc) { The ones and xtmp argument is not used in vpackI2X? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1603: > 1601: bitlevel_offset_shift = 3; > 1602: nomlarized_index_shift = 2; > 1603: } It takes a lot of effort to understand the logic here. Good to have a comment here that we are gathering 32 bit aligned elements first and then extracting the required subword from that. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613: > 1611: vpand(xtmp, idx_vec, xtmp, vlen_enc); > 1612: // Load double words from normalized indices. > 1613: evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc); Could we not do here directly: evpgatherdd(dst, gmask, Address(base, idx_vec, scale), vlen_enc); Then we dont need lines 1609-1611 and also 1616-1621 as well. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613: > 1611: vpand(xtmp, idx_vec, xtmp, vlen_enc); > 1612: // Load double words from normalized indices. > 1613: evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc); Another question, looks to me that we could read beyond the allocated memory for the array here. e.g. consider the following case: * It is a byte gather * The byte source array is of size 41, i.e. only indices 0-40 are valid * The gather index is 40 Then as part of evpgatherdd we would be reading bytes at 40-43 offset from source array. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1622: > 1620: // 16 bits(for short)/8 bits(for byte) of each double word lane. > 1621: vpsrlvd(dst, dst, xtmp, vlen_enc); > 1622: // Pack double word vector into short vector. Pack double word vector into short/byte vector. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473298053 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473305137 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473311064 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473484078 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473486672 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473488519 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473464504