Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-02 Thread Andrew Dinn
On Fri, 2 Jun 2023 09:58:59 GMT, Andrew Dinn wrote: >> Yes, of course, you are right that 0<= U_2 < 6 at the point where that >> second multiply by 5 occurs (i.e. after the loop). >> >> I believe it is safe to use the same optimization inside the loop for >> reasons given below. Of course it

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-02 Thread Andrew Dinn
On Fri, 2 Jun 2023 09:51:57 GMT, Andrew Dinn wrote: >>> This comment and the next one both need correcting. They mention U_0HI and >>> U_1HI and, as the previous comment says, those registers are dead. >>> >>> What actually happens here is best summarized as >>> >>> // U_2:U_1:U_0 += (U2 >> 2) *

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-02 Thread Andrew Dinn
On Thu, 1 Jun 2023 16:06:40 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135: >> >>> 7133: regs = (regs.remaining() + U_0HI + U_1HI).begin(); >>> 7134: >>> 7135: // U_2:U_1:U_0 += (U_1HI >> 2) >> >> This comment and the next one both need corr

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-01 Thread Andrew Haley
On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley wrote: > This comment and the next one both need correcting. They mention U_0HI and > U_1HI and, as the previous comment says, those registers are dead. > > What actually happens here is best summarized as > > // U_2:U_1:U_0 += (U2 >> 2) * 5 > > or,

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-01 Thread Andrew Haley
On Thu, 1 Jun 2023 12:16:45 GMT, Andrew Dinn wrote: > This comment and the next one both need correcting. They mention U_0HI and > U_1HI and, as the previous comment says, those registers are dead. > > What actually happens here is best summarized as > > // U_2:U_1:U_0 += (U2 >> 2) * 5 > > or, i

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-01 Thread Andrew Dinn
On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley wrote: >> This provides a solid speedup of about 3-4x over the Java implementation. >> >> I have a vectorized version of this which uses a bunch of tricks to speed it >> up, but it's complex and can still be improved. We're getting close to ramp >

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-05-26 Thread Andrew Haley
On Wed, 24 May 2023 19:16:36 GMT, Claes Redestad wrote: > Thanks for your patience in answering my questions and addressing my comments. Thank you for asking questions that made the patch better, and even removed an instruction in what I thought was a tightly-written intrinsic! -

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-05-24 Thread Claes Redestad
On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley wrote: >> This provides a solid speedup of about 3-4x over the Java implementation. >> >> I have a vectorized version of this which uses a bunch of tricks to speed it >> up, but it's complex and can still be improved. We're getting close to ramp >

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-05-24 Thread Andrew Haley
> This provides a solid speedup of about 3-4x over the Java implementation. > > I have a vectorized version of this which uses a bunch of tricks to speed it > up, but it's complex and can still be improved. We're getting close to ramp > down, so I'm submitting this simple intrinsic so that we ca