On Thu, 1 Feb 2024 16:24:16 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >> partially unrolls scalar loop to accumulates values from gather indices into >> a quadword(64bit) slice followed by vector permutation to place the slice >> into appropriate vector lanes, it prevents code bloating and generates >> compact JIT sequence. This coupled with savings from expansive array >> allocation in existing java implementation translates into significant >> performance of 1.5-10x gains with included micro. >> >>  >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 10 commits: > > - Generalizing masked sub-gather support. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 > - Fix incorrect comment > - Review comments resolutions. > - Review comments resolutions. > - Review comments resolutions. > - Restricting masked sub-word gather to AVX512 target to align with integral > gather support. > - Review comments resolution. > - 8318650: Optimized subword gather for x86 targets. I only have two comments, rest of the PR looks good to me. Thanks for taking a conservative stance and reverting to a simple solution for now. src/hotspot/cpu/x86/x86.ad line 4255: > 4253: %} > 4254: > 4255: instruct vgather_masked_subwordLE8B_avx2(vec dst, memory mem, rRegP > idx, immI_0 offset, vec mask, rRegL midx, rRegP tmp, rRegI rtmp, rRegL rtmp2) > %{ We need to kill rFlagsReg here and other avx2 instructs added since my last review. src/hotspot/cpu/x86/x86.ad line 4268: > 4266: __ mov64($midx$$Register, 0x5555555555555555ULL); > 4267: __ pextq($rtmp2$$Register, $rtmp2$$Register, $midx$$Register); > 4268: } This could be movl and pextd in newly added avx2 instructs as we are limited to 32 byte vectors. ------------- PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1860528833 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476816723 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476828631