On Wed, 4 Feb 2026 20:52:15 GMT, Ben Perez <[email protected]> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()`
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit
>> multiplication is not supported on Neon and manually performing this
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr
>> approach is used. Neon instructions are used to compute intermediate values
>> used in the last two iterations of the main "loop", while the GPRs compute
>> the first few iterations. At the method level this improves performance by
>> ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ±
>> 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ±
>> 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ±
>> 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ±
>> 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength)
>> (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256
>> thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256
>> thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256
>> thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256
>> thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ±
>> 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8...
>
> Ben Perez has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Created subroutine for 32 bit vector multiplication
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3181:
> 3179: void umullv(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3180: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant
> Ts, int lane) {
> 3181: assert(Ta == T4S || Ta == T2D, "umullv destination register must
> have arrangement T4S or T2D");
umullv -> umull{2}v in the assertion message (or consider moving the assertions
into the calling function)
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3182:
> 3180: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant
> Ts, int lane) {
> 3181: assert(Ta == T4S || Ta == T2D, "umullv destination register must
> have arrangement T4S or T2D");
> 3182: assert(Ta == T4S ? (Tb == T4H && Ts == H) : (Tb == T2S && Ts == S),
> "umullv register arrangements must adhere to spec");
umullv -> umull{2}v in the assertion message (or consider moving the assertions
into the calling function)
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3188:
> 3186: void umull2v(FloatRegister Vd, SIMD_Arrangement Ta, FloatRegister Vn,
> 3187: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant
> Ts, int lane) {
> 3188: assert(Ta == T4S || Ta == T2D, "umullv destination register must
> have arrangement T4S or T2D");
umullv -> umull2v in the assertion
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3189:
> 3187: SIMD_Arrangement Tb, FloatRegister Vm, SIMD_RegVariant
> Ts, int lane) {
> 3188: assert(Ta == T4S || Ta == T2D, "umullv destination register must
> have arrangement T4S or T2D");
> 3189: assert(Ta == T4S ? (Tb == T8H && Ts == H) : (Tb == T4S && Ts == S),
> "umullv register arrangements must adhere to spec");
umullv -> umull2v in the assertion
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7193:
> 7191:
> 7192: // Multiply each 32-bit value in bs by the 32-bit values in
> as[lane_lo] and as[lane_lo + 2]
> 7193: // and store in vs.
I think you could be a bit more specific in explaining what happens here: we
compute the partial results of
some 52 x 52 bit multiplications where the multiplicands are stored as 64-bit
values.
This function computes partial results of 8 such multiplication (b_0, b_1, b_2,
b_3) * (a_3, a_4).
In a call of this function, either the high or low 32 bits of the b_i values
are multiplied by either the high or low 32 bits of the b_j values, so four
calls with the appropriate parameters will produce the 64-bit low32 * low32,
low32 * high32, high32 * low 32 and high32 * high32 values in the output
register sequences.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7820:
> 7818: // IntegerPolynomialP521: 19 = 8 + 8 + 2 + 1
> 7819: // P521OrderField: 19 = 8 + 8 + 2 + 1
> 7820: // Special Cases 5, 10, 14, 16, 19
Add a comment in the Java code that the intrinsic can only be used for these
lengths. I would also change the Java code to use an intermediate method that
has an assert checking the allowed lengths and calls the @IntrinsicCandidate
conditionalAssign() method (this is an easy change since there is only one
caller in the current JVM code).
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7849:
> 7847: __ dup(mask_vec, __ T2D, mask_scalar);
> 7848:
> 7849: __ push(r19, sp); //needed for length = 5
If it is only needed for length == 5, just save and restore on that branch.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768475020
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474771
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768454954
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768474590
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2769260685
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768711654
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2768730689