Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Andrew Dinn Mon, 18 May 2026 08:03:55 -0700

On Fri, 15 May 2026 09:52:20 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Accepting more suggestions from Andrew Dinn.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7758:

> 7756:       __ lsr(tmp, lo, montMulP256Shift2);
> 7757:       __ orr(hi, hi, tmp);
> 7758:       __ andr(lo, lo, mask);

Suggestion:

      // compute 104-bit (40 + 64) full product
      __ umulh(hi, a, b);
      __ mul(lo, a, b);
      // combine 40 + 12 bits into hi result
      __ lsl(hi, hi, montMulP256Shift1);
      __ lsr(tmp, lo, montMulP256Shift2);
      __ orr(hi, hi, tmp);
      // mask off 52 bits of lo result
      __ andr(lo, lo, mask);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3259892902

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to