Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Andrew Dinn Fri, 02 Jun 2023 03:07:31 -0700

On Fri, 2 Jun 2023 09:51:57 GMT, Andrew Dinn <ad...@openjdk.org> wrote:


>>> This comment and the next one both need correcting. They mention U_0HI and 
>>> U_1HI and, as the previous comment says, those registers are dead.
>>>
>>> What actually happens here is best summarized as
>>>
>>> // U_2:U_1:U_0 += (U2 >> 2) * 5
>>>
>>> or, if we actually want to be clearer about the current encoding which does 
>>> it in several steps
>>>
>>> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // 
>>> U_2:U_1:U_0 += (rscratch1 << 2)
>>>
>>> i.e. any bits that are set from 130 upwards are masked off, treated as an 
>>> integer in their own right, multiplied by 5 and the result added back in at 
>>> the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
>> 
>> OK.
>> 
>>> I'm not sure whether this provides an opportunity for you to optimize this 
>>> by doing the multiply by five earlier i.e. replace the code with this 
>>> version
>> 
>> I'm not sure either, which is why it's done in two separate steps. I think 
>> you may be right, but it's a bit late to be optimizing this version any 
>> further. That would require careful analysis and a redo of all the testing.
>> 
>>> The obvious concern is that the multiply of rscratch1 by 5 might overflow 
>>> 64 bits. Is that why you have implemented two add and carry steps?
>> 
>> Indeed.
>> 
>>> If so then why is it legitimate to do the multiply by 5 up front in the 
>>> final reduction that follows the loop?
>> 
>> I assume that you're referring to the multiply by 5 in
>> 
>> 
>> // Further reduce modulo 2^130 - 5                                           
>>                                                                              
>>                                                                              
>>                                                                              
>>                                          
>>     __ lsr(rscratch1, U_2, 2);
>>     __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 
>> = U_2 * 5                                                                    
>>                                                                              
>>                                                                              
>>                                          
>> 
>> 
>> `U_2`, at this point, has only a few lower set bits. This is because `U_2` 
>> was previously ANDed with 3, and subsequently twice was the target of 
>> adc(U_2, U_2, zr). So I think that `U_2 <= 6`.
>
> Yes, of course, you are right that  0<= U_2 < 6 at the point where that 
> second multiply by 5 occurs (i.e. after the loop).
> 
> I believe it is safe to use the same optimization inside the loop for reasons 
> given below. Of course it is a bit late to change this now and retest but if 
> my reasoning is correct then we could consider updating this post release 
> and, maybe, a backport.
> 
> The only thing that needs to be determined is what value could sit in U2 when 
> we enter the loop. That's the only important case because we already agreed 
> that at the loop back edge that 0 <= U2 < 6.
> 
> The incoming value for U2 at loop entry is derived by the following 
> subsequence of the instruction stream
> 
>       __ adcs(S_1, U_1, S_1);
>       __ adc(S_2, U_2, zr);         // A.1
>       __ add(S_2, S_2, 1);          // A.2
>      . . .
>        wide_mul(U_1, U_1HI, S_0, R_1);  wide_madd(U_1, U_1HI, S_1, R_0);  
> wide_madd(U_1, U_1HI, S_2, RR_1); // B
>      . . .
>       __ andr(U_2, R_0, 3);         // C
>       __ mul(U_2, S_2, U_2);        // D
>       . . .
> 
>       __ adc(U_2, U_1HI, U_2);      // E
> 
> At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26)
> So, at A.2 we know that 0 <= S2 <= 5
> 
> At B we know that 0 <= RR_1 <= (2^60 - 2^2)  = FFFFFFF_FFFFFFFC (top 4 and 
> bottom 2 bits of RR_1 are clear)
> So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC
> 
> At C we know 0 <= U_2 <= 3
> 
> At D we know 0 <= U_2 <= 15
> 
> So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1
> 
> So, the highest possible value for U_2 at loop entry is 50000000_00000002.
> 
> Clearly we can shift this down by two and add without any danger of 
> overflowing
> 
>  50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002

Ah, no scratch that. I have made a wrong assumption at B. The value of U1_HI is 
bounded by the sum of the 3 multiplies.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214174121

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Reply via email to