Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Andrew Dinn Mon, 15 Jun 2026 05:31:32 -0700

On Fri, 22 May 2026 13:00:58 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> Actually, I see now that my comment was not correct. The write interleaves 4 
>> successive pairs of low values with four pairs of successive high values 
>> i.e. (low0, low1) (high0, high1) . . .
>> Suggestion:
>> 
>>     // the write interleaves the 4 successive pairs of low and
>>     // high results: (l0, l1), (h0, h1), ... (l6, l7), (h6, h7)
>>     vs_st1_interleaved(A, D, mul_ptr);
>
> Yes. Accepted

I still have a problem with this.

With your definition of `vs_st1_interleaved` we end up with this weird 
organization of the multiplication products where we have two low pairs and 
then their corresponding high pairs repeated x 4. There is nothing in the code 
that explains this layout other than the comment here at the point of write 
(n.b. my 'over-engineered' suggestion was actually a way of making it explicit 
what was going on here).

That results in some rather unobvious indexes for the loads we see later which 
consume these x-products:


    __ ldr(low_1, Address(sp));
    __ ldr(high_1, Address(sp, 2 * BytesPerLong));
    __ ldr(low, Address(sp, BytesPerLong));
    __ ldr(high, Address(sp, 3 * BytesPerLong));
    . . .
    __ ldr(low_1, Address(sp, 4 * BytesPerLong));
    __ ldr(high_1, Address(sp, 6 * BytesPerLong));
    . . .
    __ ldr(low, Address(sp, 5 * BytesPerLong));
    __ ldr(high, Address(sp, 7 * BytesPerLong));
    . . .


Any maintainer looking at these `ldr` instructions (including me in about 6 
months time) will end up having to search back through the preceding code to 
connect the offsets with the comment at the point of call to 
`vs_st1_interleaved`. We can avoid that redundant effort either with comments 
or clearer code can avoid it. So, the choice is:

1. Comment each ldr to point back to the explanation of the saved cross-product 
data layout provided at the call site.
2. Change the write method to store the 64-bit elements interleaved in the more 
obvious order `(lo0, hi0, lo1, hi1, ...)`.

The second option seems far preferable to me.

We need to drop `vs_st1_interleaved` and replace it with he following 
overloaded variant of `vs_st2_post`:

  // store 2 x N-vector sequences interleaved into 2 * N quadword
  // memory locations via the address supplied in base using
  // post-increment addressing.
  template<int N>
  void vs_st2_post(const VSeq<N>& v1, const VSeq<N>& v2,
                   Assembler::SIMD_Arrangement T, Register base) {
    static_assert((N & (N - 1)) == 0, "sequence length must be even");
    for (int i = 0; i < N; i++) {
      __ st2(v1[i], v2[i], T, __ post(base, 32));
    }
  }


Then change the current call and comment:


    // Interleaves the 8 low products in A (l0, ..., l7)  and 8
    // high products in D (h1, ..., h7) into 8 quadwords on the
    // stack at mulptr: ((l0, h0), ..., (l7, h7))
    v2_st2_post(A, D, __ T2D, mul_ptr);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3413400993

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to