On Fri, 22 May 2026 03:57:12 GMT, Volodymyr Paprotski <[email protected]> 
wrote:

>> Shawn Emery has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add square() intrinsics
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 32:
> 
>> 30: 
>> 31: void multiply_25519_scalar(const Register aLimbs, const Register bLimbs, 
>> const Register rLimbs, MacroAssembler* _masm) {
>> 32:   Register c0    = r9;
> 
> typically, when I create a separate a helper like this, I like to separate 
> math from registers.. i.e. deal with stack and register names in the parent 
> and only pass logical names, (aLimbs, bLimbs, rLimbs, c[], tmp1, tmp2, tmp3, 
> etc)  into here.. 
> 
> Up to you, dont think its a hard rule.

Good point, I've pulled this out of the helper method.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 37:
> 
>> 35:   Register c3    = r12;
>> 36:   Register c4    = r13;
>> 37:   Register c[]   = {c0, c1, c2, c3, c4};
> 
> Any reason to declare this separately, instead of `{r9,r10, r11, r12, r13}`

No reason now after switching to the array variable.  Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 44:
> 
>> 42: 
>> 43:   __ push(rbp);
>> 44:   __ push(rbx);
> 
> push_ppx/pop_ppx

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 51:
> 
>> 49:   __ push(rdx);
>> 50: 
>> 51:   __ xorq(c0, c0);
> 
> for (int i=0; i<5; i++) __ xorq(c[i], c[i]);

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 82:
> 
>> 80:   }
>> 81: 
>> 82:   // Carry-add with reduction from high limb
> 
> I tinkered with the asm a bit, but I cant make it work quickly; dont think 
> its possible, but maybe you know something and can make the idea work in the 
> future.. 
> 
> Usually these multiplications are giving you a result in the range of 2P, you 
> dont actually need a full reduction; Subtract P, then use the carry out from 
> the subtraction as (i.e. 
> [here](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/util/math/intpoly/MontgomeryIntegerPolynomialP256.java#L393-L417))
> 
> But that becomes a very close register-allocation puzzle (5 for c[] 5 for 
> c'[], 2xcarry, rLimbs, mask.. ~=at least 14 regs needed). You save some 
> multiplies by add/subtract.. the biasing with CARRY_ADD masks any benefit 
> even more. so this is likely minimal..

I'm not for sure that this would be more efficient in my case given that I'm 
dealing with signed limbs so a conditional check for subtract P would not work 
unless I first normalize the limbs to unsigned.  These normalization operations 
consisted of six instructions per row, which had doomed any efficiency that 
AVX-512 had with the integer fused multiply-add instructions because they 
operate on unsigned limbs.

> src/hotspot/cpu/x86/stubGenerator_x86_64_poly25519.cpp line 85:
> 
>> 83:   Register carry = bArg;
>> 84:   // Limb 3
>> 85:   __ mov64(carry, 0x4000000000000);
> 
> mask register is no longer needed; you can load to `0x4000000000000` to mask 
> then just movq to carry (save slightly on code size?)

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316079912
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316080373
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316081862
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316080973
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316082573
PR Review Comment: https://git.openjdk.org/jdk/pull/31087#discussion_r3316081510

Reply via email to