On Wed, 7 Feb 2024 18:38:29 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >> partially unrolls scalar loop to accumulates values from gather indices into >> a quadword(64bit) slice followed by vector permutation to place the slice >> into appropriate vector lanes, it prevents code bloating and generates >> compact JIT sequence. This coupled with savings from expansive array >> allocation in existing java implementation translates into significant >> performance of 1.5-10x gains with included micro. >> >>  >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Review comments resolutions. Thanks for the email-ping! A really nice feature :) I left some more requests, comments and questions. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1582: > 1580: if (elem_bt == T_SHORT) { > 1581: Label case0, case1, case2, case3; > 1582: Label *larr[] = {&case0, &case1, &case2, &case3}; Suggestion: Label* larr[] = {&case0, &case1, &case2, &case3}; src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584: > 1582: Label *larr[] = {&case0, &case1, &case2, &case3}; > 1583: for (int i = 0; i < 4; i++) { > 1584: // dst[i] = mask ? src[index[i]] : 0 I like these comments a lot! They would be even better if they had a more clear relation to the register names. `dst[i] = mask[i+midx] ? src[idx_base[i]] : 0` It would then be helpful to know the types of these arrays. It seems that `idx_base` is an array with type int, right? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1585: > 1583: for (int i = 0; i < 4; i++) { > 1584: // dst[i] = mask ? src[index[i]] : 0 > 1585: btq(mask, midx); Do I see it right, that `midx` selects what bit is selected? Why this name? Can we have something more descriptive? I know it seems to be generally the style in the AD files to not have comments and use non-descriptive register names. But that really makes the code hard to read, I basically need to reverse-engineer everything, and figure out all invariants, preconditions, postconditions etc myself. That makes reviewing hard. With so many register arguments: how am I supposed to know which registers have what meaning, which ones are modified, etc? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1595: > 1593: assert(elem_bt == T_BYTE, ""); > 1594: Label case0, case1, case2, case3, case4, case5, case6, case7; > 1595: Label *larr[] = {&case0, &case1, &case2, &case3, Suggestion: Label* larr[] = {&case0, &case1, &case2, &case3, src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1604: > 1602: pinsrb(dst, Address(base, rtmp), i); > 1603: bind(*larr[i]); > 1604: incq(midx); I don't know much about labels. But why can they not be local to the scope of the loop? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1624: > 1622: jccb(Assembler::carryClear, *larr[i]); > 1623: movl(rtmp, Address(idx_base, i * 4)); > 1624: addl(rtmp, offset); This is really the only line that is additional to the code in `vgather8b_masked`. Why not just put it under a `bool has_offset` flag, and then you ran just reuse this code here from `vgather8b_masked`, and reduce code duplication? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694: > 1692: * Gather loop first packs 4 short / 8 byte values from gather indices > 1693: * into quadword lane and then permutes quadword lane into appropriate > 1694: * location in destination vector. Following pseudo code describes the I found this a bit confusing. Would this be another way of saying this: `We divide the gather operations into quadwords of 8 bytes (or 4 shorts). Each quadword is gathered using the vgather8b instruction. We combine the quadwords into the destination register via OR and shifting (i.e. rotating / permuting) the destination register, until every quadword is in the desired location.` src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1695: > 1693: * into quadword lane and then permutes quadword lane into appropriate > 1694: * location in destination vector. Following pseudo code describes the > 1695: * algorithm in detail:- Suggestion: * algorithm in detail: src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1707: > 1705: * > 1706: * With each iteration, doubleword permute indices (0,1) corresponding > 1707: * to assembled quadword gets right shifted by two lane position. Suggestion: * With each iteration, doubleword permute indices (0,1) corresponding to gathered * quadword gets right shifted by two lane position (i.e. 8 bytes). src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1716: > 1714: XMMRegister xtmp3, Register > rtmp, > 1715: Register midx, Register length, > 1716: int vector_len, int vlen_enc) { I would like to see more descriptive names, where I don't have to reverse-engineer their meaning. What are the pre/post-conditions on `midx`? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1723: > 1721: vpxor(dst, dst, dst, vlen_enc); > 1722: vallones(xtmp2, vlen_enc); > 1723: vpsubd(xtmp2, xtmp1, xtmp2 ,vlen_enc); Suggestion: vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc); src/hotspot/cpu/x86/x86.ad line 4111: > 4109: %} > 4110: > 4111: instruct vgather_subwordLE8B(vec dst, memory mem, rRegP idx, immI_0 > offset, rRegP tmp, rRegI rtmp) %{ Suggestion: instruct vgather_subwordLE8B(vec dst, memory mem, rRegP base_index, immI_0 offset, rRegP tmp, rRegI rtmp) %{ For consistency src/hotspot/cpu/x86/x86.ad line 4120: > 4118: BasicType elem_bt = Matcher::vector_element_basic_type(this); > 4119: __ lea($tmp$$Register, $mem$$Address); > 4120: __ vgather8b(elem_bt, $dst$$XMMRegister, $tmp$$Register, > $idx$$Register, $rtmp$$Register, vlen_enc); The `LE8B` and `Matcher::vector_length_in_bytes(n) <= 8` suggest we can perform this with 4 bytes as well. Is that correct? Would that not lead to issues, when we are then reading `base_index` at bytes 4...7, which possibly have garbage, and then use that to gather? Do we have tests for that? src/hotspot/cpu/x86/x86.ad line 4143: > 4141: %} > 4142: > 4143: instruct vgather_subwordLE8B_off(vec dst, memory mem, rRegP idx, rRegI > offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{ Suggestion: instruct vgather_subwordLE8B_offset(vec dst, memory mem, rRegP idx, rRegI offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{ src/hotspot/cpu/x86/x86.ad line 4147: > 4145: match(Set dst (LoadVectorGather mem (Binary idx offset))); > 4146: effect(TEMP tmp, TEMP rtmp, KILL cr); > 4147: format %{ "vector_gatherLE8 $dst, $mem, $idx, $offset\t! using $tmp > and $rtmp as TEMP" %} Suggestion: format %{ "vector_gatherLE8_offset $dst, $mem, $idx, $offset\t! using $tmp and $rtmp as TEMP" %} src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3071: > 3069: .fromArray(lsp, indexMap, mapOffset + i) > 3070: .add(offset); > 3071: vix = VectorIntrinsics.checkIndex(vix, a.length); are you using the `vix` after this assignment? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ShortVector.java line 3809: > 3807: > 3808: // Check indices are within array bounds. > 3809: // FIXME: Check index under mask controlling. Did you mean to leave a FIXME? If so, please reference a JIRA bug number where you intend to fix it. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1889647285 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342140 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495355989 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495351192 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342349 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495341931 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495361896 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495400579 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495366111 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495405212 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495414923 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495415770 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495418305 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495424410 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426048 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426552 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495434432 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495440107