On Thu, 5 Feb 2026 21:36:09 GMT, Ben Perez <[email protected]> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()`
>> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit
>> multiplication is not supported on Neon and manually performing this
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr
>> approach is used. Neon instructions are used to compute intermediate values
>> used in the last two iterations of the main "loop", while the GPRs compute
>> the first few iterations. At the method level this improves performance by
>> ~9% and at the API level roughly 5%.
>>
>> Performance no intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ±
>> 24.923 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ±
>> 41.805 ops/s
>> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ±
>> 20.822 ops/s
>> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ±
>> 33.594 ops/s
>>
>> Benchmark (algorithm) (dataSize) (keyLength)
>> (provider) Mode Cnt Score Error Units
>> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256
>> thrpt 40 8439.881 ± 29.838 ops/s
>> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256
>> thrpt 40 7990.614 ± 30.998 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256
>> thrpt 40 2677.737 ± 8.400 ops/s
>> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256
>> thrpt 40 2619.297 ± 9.737 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1905.369 ± 3.745 ops/s
>>
>> Benchmark (algorithm) (keyLength)
>> (kpgAlgorithm) (provider) Mode Cnt Score Error Units
>> KeyAgreementBench.EC.generateSecret ECDH 256
>> EC thrpt 40 1903.997 ± 4.092 ops/s
>>
>>
>> Performance with intrinsic (Apple M1):
>>
>> Benchmark (isMontBench) Mode Cnt Score
>> Error Units
>> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ±
>> 24.722 ops/s
>> PolynomialP256Bench.benchMultiply false thrpt 8...
>
> Ben Perez has updated the pull request incrementally with one additional
> commit since the last revision:
>
> fixed indexing bug in vs_ldpq, simplified vector loads in
> generate_intpoly_assign()
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7964:
> 7962: __ BIND(L_Length14);
> 7963: {
> 7964: Register a10 = r5;
It might be nice if these general purpose register operations could be
condensed using e.g. an template type RSeq<N> and rs_xxx methods as has been
done with the vector register operations. Even better if we could implement
RSeq and VSeq as subtypes of a common template type Seq<N, R> with R bound to
Register or FloatRegister as a type parameter.
I'm not suggesting that for this PR but we should look into it via a follow-up
PR.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7988:
> 7986: __ ld1(a_vec[0], __ T2D, aLimbs);
> 7987: __ ldpq(a_vec[1], a_vec[2], Address(aLimbs, 16));
> 7988: __ ldpq(a_vec[3], a_vec[4], Address(aLimbs, 48));
I notice that here and elsewhere you have a 5 vector sequence and hence are not
using vs_ldpq/stpq operations (because they only operate on even length
sequences). However, if you add a bit of extra 'apparatus' to register.hpp you
can then use the vs_ldpq/stpq operations.
Your code processes the first register individually via ld1/st1 and then the
remaining registers using a pair of loads i.e. operate as if the latter were a
VSeq<4>. So, in register_aarch64.hpp you can add these functions:
template<int N>
FloatRegister vs_head(const VSeq<N>& v) {
static_assert(N > 1), "sequence length must be greater than 1");
return v.base();
}
template<int N>
VSeq<N> vs_tail(const VSeq<N+1>& v) {
static_assert(N > 1, "tail sequence length must be greater than 2");
return VSeq<N>(v.base() + v.delta(), v.delta());
}
With those methods available you should be able to do all these VSeq<5> loads
and stores using an ld1/st1 followed by an vs_ldpq_indexed or vs_stpq_indexed
with a suitable start index and the same constant offset array e.g. here you
could use
Suggestion:
int offsets[2] = { 0, 32 };
__ ld1(vs_head(a_vec), __ T2D, aLimbs);
vs_ldpq_indexed(vs_tail(a_vec), aLimbs, 16, offsets);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782146418
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2782125440