On Fri, 22 May 2026 03:57:12 GMT, Volodymyr Paprotski <[email protected]>
wrote:
>> Shawn Emery has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Add square() intrinsics
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 32:
>
>> 30:
>> 31: void multiply_25519_scalar(const Register aLimbs, const Register bLimbs,
>> const Register rLimbs, MacroAssembler* _masm) {
>> 32: Register c0 = r9;
>
> typically, when I create a separate a helper like this, I like to separate
> math from registers.. i.e. deal with stack and register names in the parent
> and only pass logical names, (aLimbs, bLimbs, rLimbs, c[], tmp1, tmp2, tmp3,
> etc) into here..
>
> Up to you, dont think its a hard rule.
Good point, I've pulled this out of the helper method.
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 37:
>
>> 35: Register c3 = r12;
>> 36: Register c4 = r13;
>> 37: Register c[] = {c0, c1, c2, c3, c4};
>
> Any reason to declare this separately, instead of `{r9,r10, r11, r12, r13}`
No reason now after switching to the array variable. Done.
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 44:
>
>> 42:
>> 43: __ push(rbp);
>> 44: __ push(rbx);
>
> push_ppx/pop_ppx
Done.
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 51:
>
>> 49: __ push(rdx);
>> 50:
>> 51: __ xorq(c0, c0);
>
> for (int i=0; i<5; i++) __ xorq(c[i], c[i]);
Done.
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 82:
>
>> 80: }
>> 81:
>> 82: // Carry-add with reduction from high limb
>
> I tinkered with the asm a bit, but I cant make it work quickly; dont think
> its possible, but maybe you know something and can make the idea work in the
> future..
>
> Usually these multiplications are giving you a result in the range of 2P, you
> dont actually need a full reduction; Subtract P, then use the carry out from
> the subtraction as (i.e.
> [here](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/util/math/intpoly/MontgomeryIntegerPolynomialP256.java#L393-L417))
>
> But that becomes a very close register-allocation puzzle (5 for c[] 5 for
> c'[], 2xcarry, rLimbs, mask.. ~=at least 14 regs needed). You save some
> multiplies by add/subtract.. the biasing with CARRY_ADD masks any benefit
> even more. so this is likely minimal..
I'm not for sure that this would be more efficient in my case given that I'm
dealing with signed limbs so a conditional check for subtract P would not work
unless I first normalize the limbs to unsigned. These normalization operations
consisted of six instructions per row, which had doomed any efficiency that
AVX-512 had with the integer fused multiply-add instructions because they
operate on unsigned limbs.
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 85:
>
>> 83: Register carry = bArg;
>> 84: // Limb 3
>> 85: __ mov64(carry, 0x4000000000000);
>
> mask register is no longer needed; you can load to `0x4000000000000` to mask
> then just movq to carry (save slightly on code size?)
Done.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316079912
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316080373
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316081862
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316080973
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316082573
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316081510