On Thu, 14 May 2026 12:43:21 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() >> method and IntegerPolynomial.conditionalAssign(). Since 64-bit >> multiplication is not supported on Neon and manually performing this >> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr >> approach is used. Neon instructions are used to compute intermediate values >> used in the last two iterations of the main "loop", while the GPRs compute >> the first few iterations. At the method level this improves performance by >> ~9% and at the API level roughly 5%. >> >> >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Ferenc Rakoczi has updated the pull request incrementally with one additional > commit since the last revision: > > Added AOT Code Cache related code + some cosmetic changes src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7738: > 7736: // so four calls with the appropriate parameters will produce the > 64-bit > 7737: // low32 * low32, low32 * high32, high32 * low32, high32 * high32 > 7738: // values in the output register sequences. A little more detail would make it easier to understand this method and helpt to clarify what is happening in code where it is called Suggestion: // Calls to this function accept either the low 32 bis or high 20 bits // of each b_i packed into bs in ascending order. a_0 and a_1 are packed // into successive 64 bit elements of as. lane selects the low 32 or high // 20 bits of each a_j value. So four calls with the appropriate parameters // will produce the 64-bit low32 * low32, low32 * high20, high20 * low32, // high20 * high20 values in the output register sequences vs. The // 64-bit partial products are returned in vs in ascending order: // vs[0] = (b_0*a_0, b_1*a_0) . . . vs[3] = (b_2*a_1, b_3*a_1) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3257421911
