On Mon, 26 Aug 2024 22:17:55 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Review comments resolutions.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 10229:
> 
>> 10227:   InstructionMark im(this);
>> 10228:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit 
>> || VM_Version::supports_avx512vl()), "");
>> 10229:   InstructionAttr attributes(vector_len, /* vex_w */ true,/* 
>> legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true);
> 
> vex_w could be false here.

Encoding specification mentions W bit gets ignored, so no functional issues, 
will make it false to comply with our convention.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698:
> 
>> 6696:   // Unsigned values ranges comprise of only +ve numbers, thus there 
>> exist only an upper bound saturation.
>> 6697:   // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 
>> == 1
>> 6698:   // Res = Signed Add INP1, INP2
> 
> The >>> 31 is not coded so comment could be improved to match the code.
> Comment has SRC1/INP1 term mixed.
> Also, could overflow not be implemented based on much simpler Java scalar 
> algo:
> Overflow = Res <u  (INP1 | INP2)
> This is much straight forward, also evex supports unsigned comparison.

Java scalar algo was empirically proved to hold good, I also verified with 
Alive2 solver which proved its semantic equivalence to HD section 2-13 based 
vector implementation.
Here is the link to Alive2 solver which operates on LLVM IR inputs.
[https://alive2.llvm.org/ce/z/XDQ7dY](https://alive2.llvm.org/ce/z/XDQ7dY)

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749:
> 
>> 6747:   vpor(xtmp2, xtmp2, src2, vlen_enc);
>> 6748:   // Compute mask for muxing T1 with T3 using SRC1.
>> 6749:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);
> 
> I don't think we need to do the sign extension. The blend instruction uses 
> most significant bit to do the blend.

Original vector is has double / quad word lanes which are being blended using 
byte level mask, sign extension will ensure that sign bit is propagated to MSB 
bits of each constituent byte mask corresponding double / quad word source lane.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939:
> 
>> 6937: 
>> 6938:   // Compose saturating min/max vector using first input polarity mask.
>> 6939:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);
> 
> Sign extend to lower bits not needed as blend uses msbit only.

Original vector is has double / quad word lanes which are being blended using 
byte level mask, sign extension will ensure that sign bit is propagated to MSB 
bits of each constituent byte mask corresponding double / quad word source lane.

> src/hotspot/cpu/x86/x86.ad line 1953:
> 
>> 1951:        if (UseAVX < 1 || size_in_bits < 128 || (size_in_bits == 512 && 
>> !VM_Version::supports_avx512bw())) {
>> 1952:          return false;
>> 1953:        }
> 
> UseAVX < 1 could be written as UseAVX == 0. Could we not do register version 
> for size_in_bit < 128?

I get your point, but constraints ensure we only address cases with vector size 
>= 128 bit in this patch..

> src/hotspot/cpu/x86/x86.ad line 10635:
> 
>> 10633: %}
>> 10634: 
>> 10635: instruct saturating_unsigned_add_reg_avx(vec dst, vec src1, vec src2, 
>> vec xtmp1, vec xtmp2, vec xtmp3, vec xtmp4)
> 
> Should the temp here and all the places related to !avx512vl() be legVec 
> instead of vec?

Predicate already has AVX512VL check and so does dynamic register classes 
associated with its operands.

> src/hotspot/cpu/x86/x86.ad line 10656:
> 
>> 10654:   match(Set dst (SaturatingSubVI src1 src2));
>> 10655:   match(Set dst (SaturatingSubVL src1 src2));
>> 10656:   effect(TEMP ktmp);
> 
> This needs TEMP dst as well.

There is no use of either of the source operands after assignment to dst in the 
macro assembly routine.

> src/java.base/share/classes/java/lang/Byte.java line 647:
> 
>> 645:      */
>> 646:     public static byte subSaturating(byte a, byte b) {
>> 647:         byte res = (byte)(a - b);
> 
> Could we not do subSaturating as an int operation on similar lines as 
> addSaturating?

Yes, core libs also have {add/subtract}Exact API which instead of saturating 
over / underflowing values throws ArithmeticException. Streamlining overflow 
checks for saturating long operations on the same lines to address Joe's 
concerns on new constants.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740820063
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819360
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823487
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823011
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819742
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819629
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740822270
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740821085

Reply via email to