Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Andrew Haley Thu, 01 Jun 2023 09:09:50 -0700

On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley <a...@openjdk.org> wrote:


> This comment and the next one both need correcting. They mention U_0HI and 
> U_1HI and, as the previous comment says, those registers are dead.
>
> What actually happens here is best summarized as
>
> // U_2:U_1:U_0 += (U2 >> 2) * 5
>
> or, if we actually want to be clearer about the current encoding which does 
> it in several steps
>
> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // 
> U_2:U_1:U_0 += (rscratch1 << 2)
>
> i.e. any bits that are set from 130 upwards are masked off, treated as an 
> integer in their own right, multiplied by 5 and the result added back in at 
> the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].

OK.

> I'm not sure whether this provides an opportunity for you to optimize this by 
> doing the multiply by five earlier i.e. replace the code with this version

I'm not sure either, which is why it's done in two separate steps. I think you 
may be right, but it's a bit late to be optimizing this version any further. 
That would require careful analysis and a redo of all the testing.

> The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 
> bits. Is that why you have implemented two add and carry steps?

Indeed.

> If so then why is it legitimate to do the multiply by 5 up front in the final 
> reduction that follows the loop?

I assume that you're referring to the multiply by 5 in


// Further reduce modulo 2^130 - 5                                              
                                                                                
                                                                                
                                                                                
                             
    __ lsr(rscratch1, U_2, 2);
    __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 = 
U_2 * 5                                                                         
                                                                                
                                                                                
                              


`U_2`, at this point, has only a few lower set bits. This is because `U_2` was 
previously ANDed with 3, and subsequently was the target of adc(U_2, U_2, zr).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213382344

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

Reply via email to