On Wed, 26 Oct 2022 20:45:57 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

>> (The first commit in this PR actually has the code without the check if 
>> anyone wants to measure.. well its also trivial to edit..)
>> 
>> I measured about 50% slowdown on 64 byte payloads. One could argue that 64 
>> bytes is not all that representative, but we don't get much out of assembler 
>> at that load either so it didn't seem worth it to figure out some sort of 
>> platform check. 
>> 
>> AVX512 needs at least 256 = 16 blocks.. there is overhead also 
>> pre-calculating powers of R that needs to be amortized. Assembler does fall 
>> back to 64-bit multiplies for <256, while the Java version will have to use 
>> the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not 
>> _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, 
>> and whatever the benefit of 64, it gets erased by the interface-matching 
>> code copying limbs in and out..
>> 
>> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% 
>> slower. I think its acceptable, in order to get 18x?
>> 
>> Most/all of the slowdown comes from this need of copying limbs out/in.. I am 
>> looking at perhaps copying limbs out in the intrinsic instead. Not very 
>> 'pretty'.. limbs are hidden in a nested private class behind an interface.. 
>> I would be breaking what is a good design with neat encapsulation. (I 
>> accidentally forced-pushed that earlier, if you are curious; non-working). 
>> The current version of this code seems more robust in the long term?
>
> 10% is not a negligible impact.  I see your point about AVX512 reaping the 
> rewards of this change, but there are plenty of x86_64 systems without AVX512 
> that will be impacted, not to mention other platforms like aarch64 which (for 
> this change at least) will never see the benefits from the intrinsic.
> 
> I don't have any suggestions right at this moment for how this could be 
> streamlined at all to help reduce the pain for non-AVX512 systems.  Worth 
> looking into though.

One small thing maybe: It doesn't look like R in `processMultipleBlocks` and 
`rbytes` ever changes, so maybe there's no need to repeatedly 
serialize/deserialize them on every call to engineUpdate?  There is already an 
`r` that is attached to the object that is an IntegerModuloP.  Could that be 
used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r 
is also a field in Poly1305 that can be passed into the intrinsic method rather 
than creating it every time?  It could be set in `setRSVals`.  Perhaps we can 
recover a little performance there?

-------------

PR: https://git.openjdk.org/jdk/pull/10582

Reply via email to