Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Andrew Haley Fri, 20 Feb 2026 09:18:53 -0800

On Fri, 20 Feb 2026 17:07:13 GMT, Andrew Haley <[email protected]> wrote:


>> That would be really useful! I tinkered with it a bit but would be nice to 
>> see what you had in mind
>
>> That would be really useful! I tinkered with it a bit but would be nice to 
>> see what you had in mind
> 
> Like this:
> 
>   address generate_intpoly_montgomeryMult_P256() {
> 
>     __ align(CodeEntryAlignment);
>     StubId stub_id = StubId::stubgen_intpoly_montgomeryMult_P256_id;
>     StubCodeMark mark(this, stub_id);
>     address start = __ pc();
>     __ enter();
> 
>     static const int64_t modulus[] = {
>       0x000fffffffffffffL, 0x00000fffffffffffL,
>       0x0000001000000000L, 0x0000ffffffff0000L,
>       0L
>     };
> 
>     int shift1 = 12; // 64 - bits per limb
>     int shift2 = 52; // bits per limb
> 
>     // Registers that are used throughout entire routine
>     const Register a = c_rarg0;
>     const Register b = c_rarg1;
>     const Register result = c_rarg2;
> 
>     RegSet regs = RegSet::range(r0, r28) + rfp + lr - a - b - result;
>     FloatRegSet floatRegs = FloatRegSet::range(v0, v31)
>       - FloatRegSet::range(v8, v15)   // Caller saved vectors
>       - FloatRegSet::range(v16, v31); // Manually-allocated vectors
> 
>     auto common_regs = regs.begin();
>     Register limb_mask = *common_regs++,
>       c_ptr = *common_regs++,
>       mod_0 = *common_regs++,
>       mod_1 = *common_regs++,
>       mod_3 = *common_regs++,
>       mod_4 = *common_regs++,
>       b_0 = *common_regs++,
>       b_1 = *common_regs++,
>       b_2 = *common_regs++,
>       b_3 = *common_regs++,
>       b_4 = *common_regs++;
>     regs = common_regs.remaining();
> 
>     auto common_vectors = floatRegs.begin();
>     FloatRegister limb_mask_vec = *common_vectors++,
>       b_lows = *common_vectors++,
>       b_highs = *common_vectors++,
>       a_vals = *common_vectors++;
> 
>     // Push callee saved registers on to the stack
>     RegSet callee_saved = RegSet::range(r19, r28);
>     __ push(callee_saved, sp);
> 
>     // Allocate space on the stack for carry values
>     __ sub(sp, sp, 48);
>     __ mov(c_ptr, sp);
> 
>     // Calculate limb mask
>     __ mov(limb_mask, -UCONST64(1) >> (64 - shift2));
>     __ dup(limb_mask_vec, __ T2D, limb_mask);
> 
>     // Load input arrays and modulus
>     {
>       auto r = regs.begin();
>       Register a_ptr = *r++, mod_ptr = *r++;
>       __ add(a_ptr, a, 24);
>       __ lea(mod_ptr, ExternalAddress((address)modulus));
>       __ ldr(b_0, Address(b));
>       __ ldr(b_1, Address(b, 8));
>       __ ldr(b_2, Address(b, 16));
>       __ ldr(b_3, Address(b, 24));
>       __ ldr(b_4, Address(b, 32));
>       __ ldr(mod_0, __ post(mod_ptr, 8));
>       __ ldr(mod_1, __ post(mod_ptr, 8));
>       __ ldr(mod_3, __ post(mod_ptr, 8));
>       __ ldr(mod_4, mod_ptr)...

Note that in a few places I've had to push back dead registers so that they can 
be reused. This is necessary because the live ranges for some registers 
partailly overlap.

It's much better if you don't do that: instead, write a structured 
assembly-language program in which registers are allocated in scopes as needed, 
as I've done in the section which begins like this:


    // Load input arrays and modulus
    {
      auto r = regs.begin();
      Register a_ptr = *r++, mod_ptr = *r++;


here, the register that contain`a_ptr` and `mod_ptr` are taken from the outer 
block, and are free for reuse when the inner block exits.

I hope the advantages of this style are clear: the program is easier to write, 
to maintain, and much less risky. Also, and most importantly for me, it's much 
easier to review!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2834240522

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v8]

Reply via email to