On Wed, 26 Oct 2022 15:47:08 GMT, vpaprotsk <d...@openjdk.org> wrote:
>> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171: >> >>> 169: } >>> 170: >>> 171: if (len >= 1024) { >> >> Out of curiosity, do you have any perf numbers for the impact of this change >> on systems that do not support AVX512? Does this help or hurt (or make a >> negligible impact) on poly1305 updates when the input is 1K or larger? > > (The first commit in this PR actually has the code without the check if > anyone wants to measure.. well its also trivial to edit..) > > I measured about 50% slowdown on 64 byte payloads. One could argue that 64 > bytes is not all that representative, but we don't get much out of assembler > at that load either so it didn't seem worth it to figure out some sort of > platform check. > > AVX512 needs at least 256 = 16 blocks.. there is overhead also > pre-calculating powers of R that needs to be amortized. Assembler does fall > back to 64-bit multiplies for <256, while the Java version will have to use > the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not > _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, > and whatever the benefit of 64, it gets erased by the interface-matching code > copying limbs in and out.. > > Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% > slower. I think its acceptable, in order to get 18x? > > Most/all of the slowdown comes from this need of copying limbs out/in.. I am > looking at perhaps copying limbs out in the intrinsic instead. Not very > 'pretty'.. limbs are hidden in a nested private class behind an interface.. I > would be breaking what is a good design with neat encapsulation. (I > accidentally forced-pushed that earlier, if you are curious; non-working). > The current version of this code seems more robust in the long term? 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic. I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though. ------------- PR: https://git.openjdk.org/jdk/pull/10582