On Tue, 23 Jun 2026 03:35:58 GMT, Shawn Emery <[email protected]> wrote:

>> Curve25519 polynomial arithmetic is performed with intrinsincs implemented 
>> in GPR related instructions for multiplication operations (method mult()). 
>> Benchmark improvements include:
>> 
>> X25519 decapsulation: +9%
>> X25519 encapsulation: +9%
>> X22519 key agreement: +7%
>> X25519 key-pair generation: +10%
>> X25519-MLKEM decapsulation: +7%
>> X25519-MLKEM encapsulation: +8%
>> X25519-MLKEM key-pair generation: +8%
>> EdDSA sign: +12%
>> EdDSA verify: +12%
>> EdDSA key-pair generation: +15%
>> 
>> Note 1: The difference between Aarch64 vs. x86_64 intrinsics implementation 
>> include the lack of square() intrinsics; usage caused a 3.3% performance 
>> regression due to the efficiencies of the symmetric squaring shape in Java 
>> vs. the inefficiencies of the leaf calls and the additional cycles required 
>> for 64 bit multiplication in Aarch64.
>> Note 2: The GPR related instructions were optimal when compared to hybrid 
>> (GPR related instructions for the first two iterations and Neon instructions 
>> for the last two iterations) solution.  This design produced a -4%/-1% 
>> performance drop in KEM decapsulation/encapsulation compared to the GPR 
>> related instructions where the overhead of performing the limb splits and 
>> reconstruction did not compensate enough for the efficiencies of SIMD 
>> parallelism.
>> 
>> ---------
>> - [X] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Shawn Emery has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains five additional 
> commits since the last revision:
> 
>  - Update based on shipilev's comments
>  - Merge with mainline
>  - Update based on adinn's comments
>  - Merge with master branch
>  - 8385304: X25519 should utilize aarch64 intrinsics

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 12926:

> 12924: 
> 12925:     if (UseIntPoly25519Intrinsics) {
> 12926:       StubRoutines::_intpoly_mult_25519 = 
> generate_intpoly_mult_25519();

I am looking at x86 code for this, and that architecture implements both _mult_ 
and _square_ intrinsics. First of all, this is inconsistent. But second, are we 
leaving the actual performance on the table here?

src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 660:

> 658:   }
> 659: 
> 660:   if (FLAG_IS_DEFAULT(UseIntPoly25519Intrinsics)) {

One more thing: see if this flag belongs in `AOTCODECACHE_CONFIGS_AARCH64_DO`?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3458906949
PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3457602871

Reply via email to