Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Ferenc Rakoczi Tue, 23 Jun 2026 05:23:10 -0700

On Thu, 18 Jun 2026 16:18:29 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Unite x86 and aarch64 for UseIntPolyIntrinsics for AOTCache.

I think it is rather unfortunate that this method was added to this 
microbenchmark suite as its contribution to the run time of any real crypto 
operation is minimal, so it makes almost no difference if it runs twice as 
fast. However, it is important that it runs in constant time (i.e. its running 
time is independent of the values in its input arrays and, more importantly, 
whether the value of the "set" argument is 0 or 1). The java code was written 
in such a way, but there is no guarantee that the compiler will not change it 
back to using a branch instead of the xors if it can figure out that only those 
2 values are possible for "set". So the intrinsic here is more for guaranteeing 
"set" value independent execution than for any performance gains.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30941#issuecomment-4779091715

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Reply via email to