On Tue, 20 May 2025 11:51:49 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250:
>>
>>> 248: static void montmul(int outputRegs[], int inputRegs1[], int
>>> inputRegs2[],
>>> 249: int scratchRegs1[], int scratchRegs2[], MacroAssembler
>>> *_masm) {
>>> 250: for (int i = 0; i < 4; i++) {
>>
>> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and
>> MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and
>> MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?
>
> As used in this case, they are equivalent. For z = montmul(a,b), z will be
> between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a
> power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20
> and for the intrinsic, R = 2^16. In our computations, b is always c * R mod
> q, so the montmul() really computes a * c mod q. In the Java code, we use
> 32-bit numbers for the computations, and we use R = 2^20 because that way the
> a * b numbers that occur during all computations stay in the required range
> (the inverse NTT computation is where they can grow the most), so we don't
> have to do Barrett reductions during that computation. For the intrinsics, we
> use R = 2^16, because this way we can do twice as much work in parallel, but
> we have to do Barrett reduction after levels 2 and 4 in the inverse NTT
> computation.
Thanks a lot for the explanation. It would be good to add it as a comment in
the stubGenerator_x86_64_kyber.cpp.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2098491524