On Mon, 26 Aug 2024 22:17:55 GMT, Sandhya Viswanathan <sviswanat...@openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions. > > src/hotspot/cpu/x86/assembler_x86.cpp line 10229: > >> 10227: InstructionMark im(this); >> 10228: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit >> || VM_Version::supports_avx512vl()), ""); >> 10229: InstructionAttr attributes(vector_len, /* vex_w */ true,/* >> legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); > > vex_w could be false here. Encoding specification mentions W bit gets ignored, so no functional issues, will make it false to comply with our convention. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698: > >> 6696: // Unsigned values ranges comprise of only +ve numbers, thus there >> exist only an upper bound saturation. >> 6697: // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 >> == 1 >> 6698: // Res = Signed Add INP1, INP2 > > The >>> 31 is not coded so comment could be improved to match the code. > Comment has SRC1/INP1 term mixed. > Also, could overflow not be implemented based on much simpler Java scalar > algo: > Overflow = Res <u (INP1 | INP2) > This is much straight forward, also evex supports unsigned comparison. Java scalar algo was empirically proved to hold good, I also verified with Alive2 solver which proved its semantic equivalence to HD section 2-13 based vector implementation. Here is the link to Alive2 solver which operates on LLVM IR inputs. [https://alive2.llvm.org/ce/z/XDQ7dY](https://alive2.llvm.org/ce/z/XDQ7dY) > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749: > >> 6747: vpor(xtmp2, xtmp2, src2, vlen_enc); >> 6748: // Compute mask for muxing T1 with T3 using SRC1. >> 6749: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); > > I don't think we need to do the sign extension. The blend instruction uses > most significant bit to do the blend. Original vector is has double / quad word lanes which are being blended using byte level mask, sign extension will ensure that sign bit is propagated to MSB bits of each constituent byte mask corresponding double / quad word source lane. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939: > >> 6937: >> 6938: // Compose saturating min/max vector using first input polarity mask. >> 6939: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); > > Sign extend to lower bits not needed as blend uses msbit only. Original vector is has double / quad word lanes which are being blended using byte level mask, sign extension will ensure that sign bit is propagated to MSB bits of each constituent byte mask corresponding double / quad word source lane. > src/hotspot/cpu/x86/x86.ad line 1953: > >> 1951: if (UseAVX < 1 || size_in_bits < 128 || (size_in_bits == 512 && >> !VM_Version::supports_avx512bw())) { >> 1952: return false; >> 1953: } > > UseAVX < 1 could be written as UseAVX == 0. Could we not do register version > for size_in_bit < 128? I get your point, but constraints ensure we only address cases with vector size >= 128 bit in this patch.. > src/hotspot/cpu/x86/x86.ad line 10635: > >> 10633: %} >> 10634: >> 10635: instruct saturating_unsigned_add_reg_avx(vec dst, vec src1, vec src2, >> vec xtmp1, vec xtmp2, vec xtmp3, vec xtmp4) > > Should the temp here and all the places related to !avx512vl() be legVec > instead of vec? Predicate already has AVX512VL check and so does dynamic register classes associated with its operands. > src/hotspot/cpu/x86/x86.ad line 10656: > >> 10654: match(Set dst (SaturatingSubVI src1 src2)); >> 10655: match(Set dst (SaturatingSubVL src1 src2)); >> 10656: effect(TEMP ktmp); > > This needs TEMP dst as well. There is no use of either of the source operands after assignment to dst in the macro assembly routine. > src/java.base/share/classes/java/lang/Byte.java line 647: > >> 645: */ >> 646: public static byte subSaturating(byte a, byte b) { >> 647: byte res = (byte)(a - b); > > Could we not do subSaturating as an int operation on similar lines as > addSaturating? Yes, core libs also have {add/subtract}Exact API which instead of saturating over / underflowing values throws ArithmeticException. Streamlining overflow checks for saturating long operations on the same lines to address Joe's concerns on new constants. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740820063 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819360 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823487 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823011 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819742 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819629 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740822270 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740821085