Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Andrew Dinn Tue, 19 May 2026 08:02:49 -0700

On Fri, 15 May 2026 09:52:20 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Accepting more suggestions from Andrew Dinn.

@ferakocz Still working my way through this. Here are a few more 
recommendations for comments and code cleanups. Also, I offered a correction to 
a comment where I misread the data layout for the Neon data written to the 
stack.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7869:

> 7867:     __ ldr(a_i, __ post(a, 8));
> 7868:     gpr_partial_mult_52(a_i, b_0, high, low, tmp, limb_mask);
> 7869:     __ andr(n, low, limb_mask);

It would be clearer to use `mov(n, low)` here or, at the very least `orr(n, zr, 
low)`. `low` has already been masked with `limb_mask`, which makes the `andr` 
look redundant when, in fact, the point is to save the initial value computed 
for `low`.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7873:

> 7871:     neon_partial_mult_64(B, b_highs, a_vals, 0);
> 7872: 
> 7873:     // Limb 0 cont

Suggestion:

    // Limb 0 modulus computation
    // n.b. modulus computation requires multiplying successive
    // limbs of the product by corresponding limbs of the p256
    // prime adding the result to the limb and folding this
    // partial result into a running 256-bit sum in c_i. Limbs
    // of c_i are stored via c_ptr once carries are included.
    // n.b. the mul + add is omitted for limb 2 since the
    // corresponding prime bits are zero.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7899:

> 7897:     gpr_partial_mult_52(a_i, b_2, high, low, tmp, limb_mask);
> 7898:     __ add(c_i, c_i, low);
> 7899:     __ str(c_i, Address(c_ptr, 8));

I don't really like the hard-coded numbers in these `c_ptr` address 
expressions. At the very least it would be better if they were be specified 
using a const expression like `nnn * BytesPerLong` to show that we are indexing 
an array of 64 bit data at index `nnn`.

A better idea would be to define a static helper method that also checks the 
index is in range


#define C_DATA_COUNT 5
static int c_slot_offset(index i) {
  assert(0 <= i && i < C_SLOT_COUNT, "invalid index for c_i data");
  return i * BytesPerLong;
}


You can then define the address as `Address(c_ptr, c_offset(1))` etc.

Also, once you have defined `C_DATA_COUNT` you can also define a constant 
C_DATA_SIZE for use when you extend the stack:


#define C_DATA_SIZE align_up(C_DATA_COUNT, 2) * BytesPerLong;

    . . .
    __ sub(sp, sp, C_DATA_SIZE);
    . . .


The same consideration applies with even more force as regards the hard-wired 
offsets used when accessing the Neon multiplication product data via `sp`.


    __ ldr(low_1, Address(sp));
    __ ldr(high_1, Address(sp, 16));


This also allows you to fold the details of the data layout into the function:
 

#define MUL_DATA_COUNT 8
static int mul_data_offset(index i, boolean want_high) {
  assert(0 <= i && i < MUL_DATA_COUNT, "invalid index for mul data");
  // data is stored as successive low/high pairs
  // (lo0, l1), (hi0, hi1), ... (hi2, hi3)
  int pair_start = (i >> 1) * 4;
  int high_select = (want_high ? 2 : 0)
  int pair_element = (i & 1);
  int idx = pair_start + high_select + pair_element;
  return (idx * BytesPerLong);
}

    . . .
    __ ldr(low_1, Address(sp, mul_data_offset(0, false));
    __ ldr(high_1, Address(sp, mul_data_offset(0, true)));
    . . .


Likewise you can define a constant MUL_DATA_SIZE for use when you extend the 
stack:


#define MUL_DATA_SIZE (MUL_DATA_COUNT * 2) * BytesPerLong;

    . . .
    __ sub(sp, sp, MUL_DATA_SIZE);
    . . .

-------------

PR Review: https://git.openjdk.org/jdk/pull/30941#pullrequestreview-4311333998
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3259942006
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3260445976
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3267260201

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to