On Thu, 1 Feb 2024 16:24:16 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
>> partially unrolls scalar loop to accumulates values from gather indices into 
>> a quadword(64bit) slice followed by vector permutation to place the slice 
>> into appropriate vector lanes, it prevents code bloating and generates 
>> compact JIT sequence. This coupled with savings from expansive array 
>> allocation in existing java implementation translates into significant 
>> performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 10 commits:
> 
>  - Generalizing masked sub-gather support.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Fix incorrect comment
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Restricting masked sub-word gather to AVX512 target to align with integral 
> gather support.
>  - Review comments resolution.
>  - 8318650: Optimized subword gather for x86 targets.

I only have two comments, rest of the PR looks good to me. Thanks for taking a 
conservative stance and reverting to a simple solution for now.

src/hotspot/cpu/x86/x86.ad line 4255:

> 4253: %}
> 4254: 
> 4255: instruct vgather_masked_subwordLE8B_avx2(vec dst, memory mem, rRegP 
> idx, immI_0 offset, vec mask, rRegL midx, rRegP tmp, rRegI rtmp, rRegL rtmp2) 
> %{

We need to kill rFlagsReg  here and other avx2 instructs added since my last 
review.

src/hotspot/cpu/x86/x86.ad line 4268:

> 4266:       __ mov64($midx$$Register, 0x5555555555555555ULL);
> 4267:       __ pextq($rtmp2$$Register, $rtmp2$$Register, $midx$$Register);
> 4268:     }

This could be movl and pextd in newly added avx2 instructs as we are limited to 
32 byte vectors.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1860528833
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476816723
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476828631

Reply via email to