On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Hi All,
>> 
>> As per the discussion on panama-dev mailing list[1], patch adds the support 
>> following new vector operators.
>> 
>> 
>>      . SUADD   : Saturating unsigned addition.
>>      . SADD    : Saturating signed addition. 
>>      . SUSUB   : Saturating unsigned subtraction.
>>      . SSUB    : Saturating signed subtraction.
>>      . UMAX    : Unsigned max
>>      . UMIN    : Unsigned min.
>>      
>> 
>> New vector operators are applicable to only integral types since their 
>> values wraparound in over/underflowing scenarios after setting appropriate 
>> status flags. For floating point types, as per IEEE 754 specs there are 
>> multiple schemes to handler underflow, one of them is gradual underflow 
>> which transitions the value to subnormal range. Similarly, overflow 
>> implicitly saturates the floating-point value to an Infinite value.
>> 
>> As the name suggests, these are saturating operations, i.e. the result of 
>> the computation is strictly capped by lower and upper bounds of the result 
>> type and is not wrapped around in underflowing or overflowing scenarios.
>> 
>> Summary of changes:
>> - Java side implementation of new vector operators.
>> - Add new scalar saturating APIs for each of the above saturating vector 
>> operator in corresponding primitive box classes, fallback implementation of 
>> vector operators is based over it.
>> - C2 compiler IR and inline expander changes.
>> - Optimized x86 backend implementation for new vector operators and their 
>> predicated counterparts.
>> - Extends existing VectorAPI Jtreg test suite to cover new operations.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> PS: Intrinsification and auto-vectorization of new core-lib API will be 
>> addressed separately in a follow-up patch.
>> 
>> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review comments resolutions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6674:

> 6672:   // Res = Mask ? Zero : Res
> 6673:   evmovdqu(etype, ktmp, dst, dst, false, vlen_enc);
> 6674: }

We could directly do masked evpsubd/evpsubq here with merge as false.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698:

> 6696:   // Unsigned values ranges comprise of only +ve numbers, thus there 
> exist only an upper bound saturation.
> 6697:   // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 
> == 1
> 6698:   // Res = Signed Add INP1, INP2

The >>> 31 is not coded so comment could be improved to match the code.
Comment has SRC1/INP1 term mixed.
Also, could overflow not be implemented based on much simpler Java scalar algo:
Overflow = Res <u  (INP1 | INP2)
This is much straight forward, also evex supports unsigned comparison.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6716:

> 6714: //
> 6715: // Adaptation of unsigned addition overflow detection from hacker's 
> delight
> 6716: // section 2-13 : overflow = ((a & b) | ((a | b) & ~(s))) >>> 31 == 1

Not clear what is s here?  I think it is s = a + b. Could you please update the 
comments to indicate this.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6738:

> 6736:                                                        XMMRegister 
> xtmp1, XMMRegister xtmp2, XMMRegister xtmp3,
> 6737:                                                        XMMRegister 
> xtmp4, int vlen_enc) {
> 6738:   // Res = Signed Add INP1, INP2

Wondering if we could implement overflow here also based on much simpler Java 
scalar algo:
Overflow = Res <u (INP1 | INP2) which is same as:
                   Res <s ((INP1 + MIN) | (INP2 + MIN))

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6745:

> 6743:   vpcmpeqd(xtmp3, xtmp3, xtmp3, vlen_enc);
> 6744:   // T2 = ~Res
> 6745:   vpxor(xtmp2, xtmp3, dst, vlen_enc);

Did you mean this to be T3 = ~Res

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749:

> 6747:   vpor(xtmp2, xtmp2, src2, vlen_enc);
> 6748:   // Compute mask for muxing T1 with T3 using SRC1.
> 6749:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);

I don't think we need to do the sign extension. The blend instruction uses most 
significant bit to do the blend.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6932:

> 6930: 
> 6931:   // Sign-extend to compute overflow detection mask.
> 6932:   vpsign_extend_dq(etype, xtmp3, xtmp2, vlen_enc);

Sign extend to lower bits not needed as blend uses msbit only.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939:

> 6937: 
> 6938:   // Compose saturating min/max vector using first input polarity mask.
> 6939:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);

Sign extend to lower bits not needed as blend uses msbit only.

src/hotspot/cpu/x86/x86.ad line 10656:

> 10654:   match(Set dst (SaturatingSubVI src1 src2));
> 10655:   match(Set dst (SaturatingSubVL src1 src2));
> 10656:   effect(TEMP ktmp);

This needs TEMP dst as well.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737116841
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737272705
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737306541
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737307396
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737325898
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737338765
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467234
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467902
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737489758

Reply via email to