Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v6]

Andrew Dinn Mon, 09 Feb 2026 13:39:06 -0800

On Thu, 5 Feb 2026 21:36:09 GMT, Ben Perez <[email protected]> wrote:


>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` 
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 
>> 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 
>> 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 
>> 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 
>> 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256        
>>       thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256        
>>       thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256        
>>       thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256        
>>       thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256   
>>            EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              
>> EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 
>> 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8...
>
> Ben Perez has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   fixed indexing bug in vs_ldpq, simplified vector loads in 
> generate_intpoly_assign()

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7964:

> 7962:     __ BIND(L_Length14);
> 7963:     {
> 7964:       Register a10 = r5;

It might be nice if these general purpose register operations could be 
condensed using e.g. an template type RSeq<N> and rs_xxx methods as has been 
done with the vector register operations. Even better if we could implement 
RSeq and VSeq as subtypes of a common template type Seq<N, R> with R bound to 
Register or FloatRegister as a type parameter.

I'm not suggesting that for this PR but we should look into it via a follow-up 
PR.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7988:

> 7986:       __ ld1(a_vec[0], __ T2D, aLimbs);
> 7987:       __ ldpq(a_vec[1], a_vec[2], Address(aLimbs, 16));
> 7988:       __ ldpq(a_vec[3], a_vec[4], Address(aLimbs, 48));

I notice that here and elsewhere you have a 5 vector sequence and hence are not 
using vs_ldpq/stpq operations (because they only operate on even length 
sequences). However, if you add a bit of extra 'apparatus' to register.hpp you 
can then use the vs_ldpq/stpq operations.

Your code processes the first register individually via ld1/st1 and then the 
remaining registers using a pair of loads i.e. operate as if the latter were a 
VSeq<4>. So, in register_aarch64.hpp you can add these functions:


template<int N>
FloatRegister vs_head(const VSeq<N>& v) {
  static_assert(N > 1), "sequence length must be greater than 1");
  return v.base();
}

template<int N>
VSeq<N> vs_tail(const VSeq<N+1>& v) {
  static_assert(N > 1, "tail sequence length must be greater than 2");
  return VSeq<N>(v.base() + v.delta(), v.delta());
}

With those methods available you should be able to do all these VSeq<5> loads 
and stores using an ld1/st1 followed by an vs_ldpq_indexed or vs_stpq_indexed 
with a suitable start index and the same constant offset array e.g. here you 
could use

Suggestion:

      int offsets[2] = { 0, 32 };
      __ ld1(vs_head(a_vec), __ T2D, aLimbs);
      vs_ldpq_indexed(vs_tail(a_vec), aLimbs, 16, offsets);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782146418
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782125440

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v6]

Reply via email to