On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather with high performance backend implementation >> based on hybrid algorithm which initially partially unrolls scalar loop to >> accumulates values from gather indices into a quadword(64bit) slice followed >> by vector permutation to place the slice into appropriate vector lanes, it >> prevents code bloating and generates compact >> JIT sequence. This coupled with savings from expansive array allocation in >> existing java implementation translates into significant performance of >> 1.3-5x gains with included micro. >> >> >>  >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> 3) Some minor adjustments in existing gather instruction pattens for >> double/quad words. >> >> >> Kindly review and share your feedback. >> >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Fix incorrect comment > BTW, I have two questions: > > 1. An intrinsic which should accept the vector as index like non-subword > gather is more benefical in real applications. See: [8287289: Gather/Scatter > with Index Vector > panama-vector#201](https://github.com/openjdk/panama-vector/pull/201) please. > 2. Do you have the plan for adding such optimization for subword scatter in > future? > > Thanks, Xiaohong I agree, proposal looks reasonable to me, but given that x86 ISA does not have direct sub-word gather instruction hence we will always need to pass index array to inline expander. Existing interface provisions passing both index array and vector. For scatter we may not benefit much as intent of this patch was to intrinsify and align sub-word with non-subword gather implementation and save extra allocations in existing java implementation. Best Regards. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1815796362