On Tue, 23 Jun 2026 03:35:58 GMT, Shawn Emery <[email protected]> wrote:
>> Curve25519 polynomial arithmetic is performed with intrinsincs implemented >> in GPR related instructions for multiplication operations (method mult()). >> Benchmark improvements include: >> >> X25519 decapsulation: +9% >> X25519 encapsulation: +9% >> X22519 key agreement: +7% >> X25519 key-pair generation: +10% >> X25519-MLKEM decapsulation: +7% >> X25519-MLKEM encapsulation: +8% >> X25519-MLKEM key-pair generation: +8% >> EdDSA sign: +12% >> EdDSA verify: +12% >> EdDSA key-pair generation: +15% >> >> Note 1: The difference between Aarch64 vs. x86_64 intrinsics implementation >> include the lack of square() intrinsics; usage caused a 3.3% performance >> regression due to the efficiencies of the symmetric squaring shape in Java >> vs. the inefficiencies of the leaf calls and the additional cycles required >> for 64 bit multiplication in Aarch64. >> Note 2: The GPR related instructions were optimal when compared to hybrid >> (GPR related instructions for the first two iterations and Neon instructions >> for the last two iterations) solution. This design produced a -4%/-1% >> performance drop in KEM decapsulation/encapsulation compared to the GPR >> related instructions where the overhead of performing the limb splits and >> reconstruction did not compensate enough for the efficiencies of SIMD >> parallelism. >> >> --------- >> - [X] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Shawn Emery has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains five additional > commits since the last revision: > > - Update based on shipilev's comments > - Merge with mainline > - Update based on adinn's comments > - Merge with master branch > - 8385304: X25519 should utilize aarch64 intrinsics src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 12926: > 12924: > 12925: if (UseIntPoly25519Intrinsics) { > 12926: StubRoutines::_intpoly_mult_25519 = > generate_intpoly_mult_25519(); I am looking at x86 code for this, and that architecture implements both _mult_ and _square_ intrinsics. First of all, this is inconsistent. But second, are we leaving the actual performance on the table here? src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 660: > 658: } > 659: > 660: if (FLAG_IS_DEFAULT(UseIntPoly25519Intrinsics)) { One more thing: see if this flag belongs in `AOTCODECACHE_CONFIGS_AARCH64_DO`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3458906949 PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3457602871
