Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Andrew Dinn Fri, 02 Jun 2023 04:05:49 -0700

On Fri, 2 Jun 2023 09:58:59 GMT, Andrew Dinn <ad...@openjdk.org> wrote:


>> Yes, of course, you are right that  0<= U_2 < 6 at the point where that 
>> second multiply by 5 occurs (i.e. after the loop).
>> 
>> I believe it is safe to use the same optimization inside the loop for 
>> reasons given below. Of course it is a bit late to change this now and 
>> retest but if my reasoning is correct then we could consider updating this 
>> post release and, maybe, a backport.
>> 
>> The only thing that needs to be determined is what value could sit in U2 
>> when we enter the loop. That's the only important case because we already 
>> agreed that at the loop back edge that 0 <= U2 < 6.
>> 
>> The incoming value for U2 at loop entry is derived by the following 
>> subsequence of the instruction stream
>> 
>>       __ adcs(S_1, U_1, S_1);
>>       __ adc(S_2, U_2, zr);         // A.1
>>       __ add(S_2, S_2, 1);          // A.2
>>      . . .
>>        wide_mul(U_1, U_1HI, S_0, R_1);  wide_madd(U_1, U_1HI, S_1, R_0);  
>> wide_madd(U_1, U_1HI, S_2, RR_1); // B
>>      . . .
>>       __ andr(U_2, R_0, 3);         // C
>>       __ mul(U_2, S_2, U_2);        // D
>>       . . .
>> 
>>       __ adc(U_2, U_1HI, U_2);      // E
>> 
>> At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26)
>> So, at A.2 we know that 0 <= S2 <= 5
>> 
>> At B we know that 0 <= RR_1 <= (2^60 - 2^2)  = FFFFFFF_FFFFFFFC (top 4 and 
>> bottom 2 bits of RR_1 are clear)
>> So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC
>> 
>> At C we know 0 <= U_2 <= 3
>> 
>> At D we know 0 <= U_2 <= 15
>> 
>> So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1
>> 
>> So, the highest possible value for U_2 at loop entry is 50000000_00000002.
>> 
>> Clearly we can shift this down by two and add without any danger of 
>> overflowing
>> 
>>  50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002
>
> Ah, no scratch that. I have made a wrong assumption at B. The value of U1_HI 
> is bounded by the sum of the 3 64 bit * 64 bit multiplies.

I think there is still a proof of validity to be salvaged though. We compute a 
128 bit product:

  U1_HI:U1 = S_0 * R_1 + S_1 * R_0 + S_2 * RR_1

We know that R_0 and R_1 have four top bits clear and S2 <= 5. So, I think we 
can guarantee that the top word of the 128 bit product is small enough to not 
to overflow when we do the shift. Even if we assume S_0, S_1 and RR_1 have the 
maximum possible value we have

S_0 * R_1 <= (2^64 - 1) * (2^60 - 1)
S_1 * R_0 <= (2^64 - 1) * (2^60 - 1)
S_2 * RR_1  <= 5 * (2^64 - 1)

So, the top word U1_HI is bounded by at most (2 * (2^60 - 1)) + 5. That leaves 
more than enough room to guarantee that the shift and add will not overflow.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214230004

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Reply via email to