Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

Emanuel Peter Tue, 20 Feb 2024 00:54:08 -0800

On Wed, 7 Feb 2024 18:38:29 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
>> partially unrolls scalar loop to accumulates values from gather indices into 
>> a quadword(64bit) slice followed by vector permutation to place the slice 
>> into appropriate vector lanes, it prevents code bloating and generates 
>> compact JIT sequence. This coupled with savings from expansive array 
>> allocation in existing java implementation translates into significant 
>> performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review comments resolutions.

Thanks for the email-ping!
A really nice feature :)
I left some more requests, comments and questions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1582:

> 1580:   if (elem_bt == T_SHORT) {
> 1581:     Label case0, case1, case2, case3;
> 1582:     Label *larr[] = {&case0, &case1, &case2, &case3};

Suggestion:

    Label* larr[] = {&case0, &case1, &case2, &case3};

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584:

> 1582:     Label *larr[] = {&case0, &case1, &case2, &case3};
> 1583:     for (int i = 0; i < 4; i++) {
> 1584:       // dst[i] = mask ? src[index[i]] : 0

I like these comments a lot!
They would be even better if they had a more clear relation to the register 
names.

`dst[i] = mask[i+midx] ? src[idx_base[i]] : 0`
It would then be helpful to know the types of these arrays.
It seems that `idx_base` is an array with type int, right?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1585:

> 1583:     for (int i = 0; i < 4; i++) {
> 1584:       // dst[i] = mask ? src[index[i]] : 0
> 1585:       btq(mask, midx);

Do I see it right, that `midx` selects what bit is selected? Why this name? Can 
we have something more descriptive?
I know it seems to be generally the style in the AD files to not have comments 
and use non-descriptive register names. But that really makes the code hard to 
read, I basically need to reverse-engineer everything, and figure out all 
invariants, preconditions, postconditions etc myself. That makes reviewing hard.
With so many register arguments: how am I supposed to know which registers have 
what meaning, which ones are modified, etc?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1595:

> 1593:     assert(elem_bt == T_BYTE, "");
> 1594:     Label case0, case1, case2, case3, case4, case5, case6, case7;
> 1595:     Label *larr[] = {&case0, &case1, &case2, &case3,

Suggestion:

    Label* larr[] = {&case0, &case1, &case2, &case3,

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1604:

> 1602:       pinsrb(dst, Address(base, rtmp), i);
> 1603:       bind(*larr[i]);
> 1604:       incq(midx);

I don't know much about labels. But why can they not be local to the scope of 
the loop?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1624:

> 1622:       jccb(Assembler::carryClear, *larr[i]);
> 1623:       movl(rtmp, Address(idx_base, i * 4));
> 1624:       addl(rtmp, offset);

This is really the only line that is additional to the code in 
`vgather8b_masked`. Why not just put it under a `bool has_offset` flag, and 
then you ran just reuse this code here from `vgather8b_masked`, and reduce code 
duplication?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694:

> 1692:  * Gather loop first packs 4 short / 8 byte values from gather indices
> 1693:  * into quadword lane and then permutes quadword lane into appropriate
> 1694:  * location in destination vector. Following pseudo code describes the

I found this a bit confusing. Would this be another way of saying this:
`We divide the gather operations into quadwords of 8 bytes (or 4 shorts). Each 
quadword is gathered using the vgather8b instruction. We combine the quadwords 
into the destination register via OR and shifting (i.e. rotating / permuting) 
the destination register, until every quadword is in the desired location.`

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1695:

> 1693:  * into quadword lane and then permutes quadword lane into appropriate
> 1694:  * location in destination vector. Following pseudo code describes the
> 1695:  * algorithm in detail:-

Suggestion:

 * algorithm in detail:

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1707:

> 1705:  *
> 1706:  * With each iteration, doubleword permute indices (0,1) corresponding
> 1707:  * to assembled quadword gets right shifted by two lane position.

Suggestion:

 * With each iteration, doubleword permute indices (0,1) corresponding to 
gathered
 * quadword gets right shifted by two lane position (i.e. 8 bytes).

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1716:

> 1714:                                         XMMRegister xtmp3, Register 
> rtmp,
> 1715:                                         Register midx, Register length,
> 1716:                                         int vector_len, int vlen_enc) {

I would like to see more descriptive names, where I don't have to 
reverse-engineer their meaning.
What are the pre/post-conditions on `midx`?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1723:

> 1721:   vpxor(dst, dst, dst, vlen_enc);
> 1722:   vallones(xtmp2, vlen_enc);
> 1723:   vpsubd(xtmp2, xtmp1, xtmp2 ,vlen_enc);

Suggestion:

  vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);

src/hotspot/cpu/x86/x86.ad line 4111:

> 4109: %}
> 4110: 
> 4111: instruct vgather_subwordLE8B(vec dst, memory mem, rRegP idx, immI_0 
> offset, rRegP tmp, rRegI rtmp) %{

Suggestion:

instruct vgather_subwordLE8B(vec dst, memory mem, rRegP base_index, immI_0 
offset, rRegP tmp, rRegI rtmp) %{

For consistency

src/hotspot/cpu/x86/x86.ad line 4120:

> 4118:     BasicType elem_bt = Matcher::vector_element_basic_type(this);
> 4119:     __ lea($tmp$$Register, $mem$$Address);
> 4120:     __ vgather8b(elem_bt, $dst$$XMMRegister, $tmp$$Register, 
> $idx$$Register, $rtmp$$Register, vlen_enc);

The `LE8B` and `Matcher::vector_length_in_bytes(n) <= 8` suggest we can perform 
this with 4 bytes as well.
Is that correct?
Would that not lead to issues, when we are then reading `base_index` at bytes 
4...7, which possibly have garbage, and then use that to gather?
Do we have tests for that?

src/hotspot/cpu/x86/x86.ad line 4143:

> 4141: %}
> 4142: 
> 4143: instruct vgather_subwordLE8B_off(vec dst, memory mem, rRegP idx, rRegI 
> offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{

Suggestion:

instruct vgather_subwordLE8B_offset(vec dst, memory mem, rRegP idx, rRegI 
offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{

src/hotspot/cpu/x86/x86.ad line 4147:

> 4145:   match(Set dst (LoadVectorGather mem (Binary idx offset)));
> 4146:   effect(TEMP tmp, TEMP rtmp, KILL cr);
> 4147:   format %{ "vector_gatherLE8 $dst, $mem, $idx, $offset\t! using $tmp 
> and $rtmp as TEMP" %}

Suggestion:

  format %{ "vector_gatherLE8_offset $dst, $mem, $idx, $offset\t! using $tmp 
and $rtmp as TEMP" %}

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 3071:

> 3069:                 .fromArray(lsp, indexMap, mapOffset + i)
> 3070:                 .add(offset);
> 3071:             vix = VectorIntrinsics.checkIndex(vix, a.length);

are you using the `vix` after this assignment?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ShortVector.java 
line 3809:

> 3807: 
> 3808:         // Check indices are within array bounds.
> 3809:         // FIXME: Check index under mask controlling.

Did you mean to leave a FIXME? If so, please reference a JIRA bug number where 
you intend to fix it.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1889647285
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342140
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495355989
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495351192
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342349
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495341931
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495361896
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495400579
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495366111
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495405212
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495414923
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495415770
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495418305
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495424410
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426048
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426552
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495434432
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495440107

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

Reply via email to