On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support >> following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their >> values wraparound in over/underflowing scenarios after setting appropriate >> status flags. For floating point types, as per IEEE 754 specs there are >> multiple schemes to handler underflow, one of them is gradual underflow >> which transitions the value to subnormal range. Similarly, overflow >> implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of >> the computation is strictly capped by lower and upper bounds of the result >> type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector >> operator in corresponding primitive box classes, fallback implementation of >> vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their >> predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be >> addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6674: > 6672: // Res = Mask ? Zero : Res > 6673: evmovdqu(etype, ktmp, dst, dst, false, vlen_enc); > 6674: } We could directly do masked evpsubd/evpsubq here with merge as false. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698: > 6696: // Unsigned values ranges comprise of only +ve numbers, thus there > exist only an upper bound saturation. > 6697: // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 > == 1 > 6698: // Res = Signed Add INP1, INP2 The >>> 31 is not coded so comment could be improved to match the code. Comment has SRC1/INP1 term mixed. Also, could overflow not be implemented based on much simpler Java scalar algo: Overflow = Res <u (INP1 | INP2) This is much straight forward, also evex supports unsigned comparison. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6716: > 6714: // > 6715: // Adaptation of unsigned addition overflow detection from hacker's > delight > 6716: // section 2-13 : overflow = ((a & b) | ((a | b) & ~(s))) >>> 31 == 1 Not clear what is s here? I think it is s = a + b. Could you please update the comments to indicate this. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6738: > 6736: XMMRegister > xtmp1, XMMRegister xtmp2, XMMRegister xtmp3, > 6737: XMMRegister > xtmp4, int vlen_enc) { > 6738: // Res = Signed Add INP1, INP2 Wondering if we could implement overflow here also based on much simpler Java scalar algo: Overflow = Res <u (INP1 | INP2) which is same as: Res <s ((INP1 + MIN) | (INP2 + MIN)) src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6745: > 6743: vpcmpeqd(xtmp3, xtmp3, xtmp3, vlen_enc); > 6744: // T2 = ~Res > 6745: vpxor(xtmp2, xtmp3, dst, vlen_enc); Did you mean this to be T3 = ~Res src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749: > 6747: vpor(xtmp2, xtmp2, src2, vlen_enc); > 6748: // Compute mask for muxing T1 with T3 using SRC1. > 6749: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); I don't think we need to do the sign extension. The blend instruction uses most significant bit to do the blend. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6932: > 6930: > 6931: // Sign-extend to compute overflow detection mask. > 6932: vpsign_extend_dq(etype, xtmp3, xtmp2, vlen_enc); Sign extend to lower bits not needed as blend uses msbit only. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939: > 6937: > 6938: // Compose saturating min/max vector using first input polarity mask. > 6939: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); Sign extend to lower bits not needed as blend uses msbit only. src/hotspot/cpu/x86/x86.ad line 10656: > 10654: match(Set dst (SaturatingSubVI src1 src2)); > 10655: match(Set dst (SaturatingSubVL src1 src2)); > 10656: effect(TEMP ktmp); This needs TEMP dst as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737116841 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737272705 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737306541 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737307396 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737325898 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737338765 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467234 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467902 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737489758