Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 13:39:10 GMT, Claes Redestad wrote: >> See https://loup-vaillant.fr/tutorials/poly1305-design for more explanation > > Thanks for the link! > > So `r` refers to the value passed via `r_start` and it wasn't clear from the > immediate context that `r_start` is already split i

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Claes Redestad
On Wed, 24 May 2023 11:08:31 GMT, Andrew Haley wrote: >> No, it doesn't break the invariants. >> >> R is the randomly-chosen 128-bit key. It is generated from an initial >> 128-bit-log string of random bits, then >> `r &= 0x0ffc0ffc0ffc0fff` >> >> This 128-bit-long string is sp

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 10:18:39 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097: >> >>> 7095: // together partial products without any risk of needing to >>> 7096: // propagate a carry out. >>> 7097: wide_mul(U_0, U_0HI, S_0, R_0); wide_mad

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 10:07:47 GMT, Claes Redestad wrote: >> Andrew Haley has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Whitespace > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097: > >> 7095: // together partial pro

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Claes Redestad
On Wed, 24 May 2023 09:25:06 GMT, Andrew Haley wrote: >> This provides a solid speedup of about 3-4x over the Java implementation. >> >> I have a vectorized version of this which uses a bunch of tricks to speed it >> up, but it's complex and can still be improved. We're getting close to ramp >

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
> This provides a solid speedup of about 3-4x over the Java implementation. > > I have a vectorized version of this which uses a bunch of tricks to speed it > up, but it's complex and can still be improved. We're getting close to ramp > down, so I'm submitting this simple intrinsic so that we ca