On Wed, 26 Oct 2022 20:45:57 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> (The first commit in this PR actually has the code without the check if >> anyone wants to measure.. well its also trivial to edit..) >> >> I measured about 50% slowdown on 64 byte payloads. One could argue that 64 >> bytes is not all that representative, but we don't get much out of assembler >> at that load either so it didn't seem worth it to figure out some sort of >> platform check. >> >> AVX512 needs at least 256 = 16 blocks.. there is overhead also >> pre-calculating powers of R that needs to be amortized. Assembler does fall >> back to 64-bit multiplies for <256, while the Java version will have to use >> the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not >> _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, >> and whatever the benefit of 64, it gets erased by the interface-matching >> code copying limbs in and out.. >> >> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% >> slower. I think its acceptable, in order to get 18x? >> >> Most/all of the slowdown comes from this need of copying limbs out/in.. I am >> looking at perhaps copying limbs out in the intrinsic instead. Not very >> 'pretty'.. limbs are hidden in a nested private class behind an interface.. >> I would be breaking what is a good design with neat encapsulation. (I >> accidentally forced-pushed that earlier, if you are curious; non-working). >> The current version of this code seems more robust in the long term? > > 10% is not a negligible impact. I see your point about AVX512 reaping the > rewards of this change, but there are plenty of x86_64 systems without AVX512 > that will be impacted, not to mention other platforms like aarch64 which (for > this change at least) will never see the benefits from the intrinsic. > > I don't have any suggestions right at this moment for how this could be > streamlined at all to help reduce the pain for non-AVX512 systems. Worth > looking into though. One small thing maybe: It doesn't look like R in `processMultipleBlocks` and `rbytes` ever changes, so maybe there's no need to repeatedly serialize/deserialize them on every call to engineUpdate? There is already an `r` that is attached to the object that is an IntegerModuloP. Could that be used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r is also a field in Poly1305 that can be passed into the intrinsic method rather than creating it every time? It could be set in `setRSVals`. Perhaps we can recover a little performance there? ------------- PR: https://git.openjdk.org/jdk/pull/10582