Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Andrew Dinn Fri, 02 Jun 2023 02:58:09 -0700

On Thu, 1 Jun 2023 16:06:40 GMT, Andrew Haley <a...@openjdk.org> wrote:


>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:
>> 
>>> 7133:       regs = (regs.remaining() + U_0HI + U_1HI).begin();
>>> 7134: 
>>> 7135:       // U_2:U_1:U_0 += (U_1HI >> 2)
>> 
>> This comment and the next one both need correcting. They mention U_0HI and 
>> U_1HI and, as the previous comment says, those registers are dead.
>> 
>> What actually happens here is best summarized as
>> 
>> // U_2:U_1:U_0 += (U2 >> 2) * 5
>> 
>> or, if we actually want to be clearer about the current encoding which does 
>> it in several steps
>> 
>> // rscratch1 = (U2 >> 2)
>> // U2 = U2[1:0]
>> // U_2:U_1:U_0 += rscratch1
>> // U_2:U_1:U_0 += (rscratch1 << 2)
>> 
>> i.e. any bits that are set from 130 upwards are masked off, treated as an 
>> integer in their own right, multiplied by 5 and the result added back in at 
>> the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
>> 
>> I'm not sure whether this provides an opportunity for you to optimize this 
>> by doing the multiply by five earlier i.e. replace the code with this version
>> 
>>     // rscratch1 = (U2 >> 2) * 5
>>     __ lsr(rscratch1, U_2, 2);
>>     __ add(rscratch1, rscratch1, scratch1, Assembler::LSL, 2);
>>     // U2 = U2[1:0]
>>     __ andr(U_2, U_2, (u8)3);
>>     // U2:U1:U0 += rscratch1
>>     __ adds(U_0, U_0, rscratch1);
>>     __ adcs(U_1, U_1, zr);
>>     __ adc(U_2, U_2, zr);
>> 
>> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 
>> bits. Is that why you have implemented two add and carry steps? If so then 
>> why is it legitimate to do the multiply by 5 up front in the final reduction 
>> that follows the loop?
>
>> This comment and the next one both need correcting. They mention U_0HI and 
>> U_1HI and, as the previous comment says, those registers are dead.
>>
>> What actually happens here is best summarized as
>>
>> // U_2:U_1:U_0 += (U2 >> 2) * 5
>>
>> or, if we actually want to be clearer about the current encoding which does 
>> it in several steps
>>
>> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // 
>> U_2:U_1:U_0 += (rscratch1 << 2)
>>
>> i.e. any bits that are set from 130 upwards are masked off, treated as an 
>> integer in their own right, multiplied by 5 and the result added back in at 
>> the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
> 
> OK.
> 
>> I'm not sure whether this provides an opportunity for you to optimize this 
>> by doing the multiply by five earlier i.e. replace the code with this version
> 
> I'm not sure either, which is why it's done in two separate steps. I think 
> you may be right, but it's a bit late to be optimizing this version any 
> further. That would require careful analysis and a redo of all the testing.
> 
>> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 
>> bits. Is that why you have implemented two add and carry steps?
> 
> Indeed.
> 
>> If so then why is it legitimate to do the multiply by 5 up front in the 
>> final reduction that follows the loop?
> 
> I assume that you're referring to the multiply by 5 in
> 
> 
> // Further reduce modulo 2^130 - 5                                            
>                                                                               
>                                                                               
>                                                                               
>                                      
>     __ lsr(rscratch1, U_2, 2);
>     __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 
> = U_2 * 5                                                                     
>                                                                               
>                                                                               
>                                       
> 
> 
> `U_2`, at this point, has only a few lower set bits. This is because `U_2` 
> was previously ANDed with 3, and subsequently twice was the target of 
> adc(U_2, U_2, zr). So I think that `U_2 <= 6`.

Yes, of course, you are right that  0<= U_2 < 6 at the point where that second 
multiply by 5 occurs (i.e. after the loop).

I believe it is safe to use the same optimization inside the loop for reasons 
given below. Of course it is a bit late to change this now and retest but if my 
reasoning is correct then we could consider updating this post release and, 
maybe, a backport.

The only thing that needs to be determined is what value could sit in U2 when 
we enter the loop. That's the only important case because we already agreed 
that at the loop back edge that 0 <= U2 < 6.

The incoming value for U2 at loop entry is derived by the following subsequence 
of the instruction stream

      __ adcs(S_1, U_1, S_1);
      __ adc(S_2, U_2, zr);         // A.1
      __ add(S_2, S_2, 1);          // A.2
     . . .
       wide_mul(U_1, U_1HI, S_0, R_1);  wide_madd(U_1, U_1HI, S_1, R_0);  
wide_madd(U_1, U_1HI, S_2, RR_1); // B
     . . .
      __ andr(U_2, R_0, 3);         // C
      __ mul(U_2, S_2, U_2);        // D
      . . .

      __ adc(U_2, U_1HI, U_2);      // E

At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26)
So, at A.2 we know that 0 <= S2 <= 5

At B we know that 0 <= RR_1 <= (2^60 - 2^2)  = FFFFFFF_FFFFFFFC (top 4 and 
bottom 2 bits of RR_1 are clear)
So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC

At C we know 0 <= U_2 <= 3

At D we know 0 <= U_2 <= 15

So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1

So, the highest possible value for U_2 at loop entry is 50000000_00000002.

Clearly we can shift this down by two and add without any danger of overflowing

 50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214167215

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Reply via email to