Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Ferenc Rakoczi Thu, 05 Feb 2026 05:56:42 -0800

On Wed, 4 Feb 2026 20:52:15 GMT, Ben Perez <[email protected]> wrote:


>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` 
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%. 
>> 
>> Performance no intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2427.562 ± 
>> 24.923  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8  1757.495 ± 
>> 41.805  ops/s
>> PolynomialP256Bench.benchSquare             true  thrpt    8  2435.202 ± 
>> 20.822  ops/s
>> PolynomialP256Bench.benchSquare            false  thrpt    8  2420.390 ± 
>> 33.594  ops/s
>> 
>> Benchmark                        (algorithm)  (dataSize)  (keyLength)  
>> (provider)   Mode  Cnt      Score     Error  Units
>> SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256        
>>       thrpt   40   8439.881 ±  29.838  ops/s
>> SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256        
>>       thrpt   40   7990.614 ±  30.998  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256        
>>       thrpt   40   2677.737 ±   8.400  ops/s
>> SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256        
>>       thrpt   40   2619.297 ±   9.737  ops/s
>> 
>> Benchmark                                         (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
>> KeyAgreementBench.EC.generateSecret                      ECDH          256   
>>            EC              thrpt   40  1905.369 ±  3.745  ops/s
>> 
>> Benchmark                             (algorithm)  (keyLength)  
>> (kpgAlgorithm)  (provider)   Mode  Cnt     Score   Error  Units
>> KeyAgreementBench.EC.generateSecret          ECDH          256              
>> EC              thrpt   40  1903.997 ± 4.092  ops/s
>> 
>> 
>> Performance with intrinsic (Apple M1):
>> 
>> Benchmark                          (isMontBench)   Mode  Cnt     Score    
>> Error  Units
>> PolynomialP256Bench.benchMultiply           true  thrpt    8  2676.599 ± 
>> 24.722  ops/s
>> PolynomialP256Bench.benchMultiply          false  thrpt    8...
>
> Ben Perez has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Created subroutine for 32 bit vector multiplication

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3181:

> 3179:   void umullv(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3180:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant 
> Ts, int lane) {
> 3181:     assert(Ta == T4S || Ta == T2D, "umullv destination register must 
> have arrangement T4S or T2D");

umullv -> umull{2}v in the assertion message (or consider moving the assertions 
into the calling function)

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3182:

> 3180:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant 
> Ts, int lane) {
> 3181:     assert(Ta == T4S || Ta == T2D, "umullv destination register must 
> have arrangement T4S or T2D");
> 3182:     assert(Ta == T4S ? (Tb == T4H && Ts == H) : (Tb == T2S && Ts == S), 
> "umullv register arrangements must adhere to spec");

umullv -> umull{2}v in the assertion message (or consider moving the assertions 
into the calling function)

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3188:

> 3186:   void umull2v(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3187:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant 
> Ts, int lane) {
> 3188:     assert(Ta == T4S || Ta == T2D, "umullv destination register must 
> have arrangement T4S or T2D");

umullv -> umull2v in the assertion

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3189:

> 3187:                SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant 
> Ts, int lane) {
> 3188:     assert(Ta == T4S || Ta == T2D, "umullv destination register must 
> have arrangement T4S or T2D");
> 3189:     assert(Ta == T4S ? (Tb == T8H && Ts == H) : (Tb == T4S && Ts == S), 
> "umullv register arrangements must adhere to spec");

umullv -> umull2v in the assertion

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7193:

> 7191: 
> 7192:   // Multiply each 32-bit value in bs by the 32-bit values in 
> as[lane_lo] and as[lane_lo + 2]
> 7193:   // and store in vs.

I think you could be a bit more specific in explaining what happens here: we 
compute the partial results of
some 52 x 52 bit multiplications where the multiplicands are stored as 64-bit 
values. 
This function computes partial results of 8 such multiplication (b_0, b_1, b_2, 
b_3) * (a_3, a_4).
In a call of this function, either the high or low 32 bits of the b_i values 
are multiplied by either the high or low 32 bits of the b_j values, so four 
calls with the appropriate parameters will produce the 64-bit  low32 * low32, 
low32 * high32, high32 * low 32 and high32 * high32 values in the output 
register sequences.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7820:

> 7818:     //   IntegerPolynomialP521:  19 = 8 + 8 + 2 + 1
> 7819:     //   P521OrderField:         19 = 8 + 8 + 2 + 1
> 7820:     // Special Cases 5, 10, 14, 16, 19

Add a comment in the Java code that the intrinsic can only be used for these 
lengths. I would also change the Java code to use an intermediate method that 
has an assert checking the allowed lengths and calls the @IntrinsicCandidate 
conditionalAssign() method (this is an easy change since there is only one 
caller in the current JVM code).

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7849:

> 7847:     __ dup(mask_vec, __ T2D, mask_scalar);
> 7848: 
> 7849:     __ push(r19, sp); //needed for length = 5

If it is only needed for length == 5, just save and restore on that branch.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768475020
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474771
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768454954
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474590
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2769260685
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768711654
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768730689

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to