On Tue, 20 May 2025 11:51:49 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:
>> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250: >> >>> 248: static void montmul(int outputRegs[], int inputRegs1[], int >>> inputRegs2[], >>> 249: int scratchRegs1[], int scratchRegs2[], MacroAssembler >>> *_masm) { >>> 250: for (int i = 0; i < 4; i++) { >> >> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and >> MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and >> MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent? > > As used in this case, they are equivalent. For z = montmul(a,b), z will be > between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a > power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 > and for the intrinsic, R = 2^16. In our computations, b is always c * R mod > q, so the montmul() really computes a * c mod q. In the Java code, we use > 32-bit numbers for the computations, and we use R = 2^20 because that way the > a * b numbers that occur during all computations stay in the required range > (the inverse NTT computation is where they can grow the most), so we don't > have to do Barrett reductions during that computation. For the intrinsics, we > use R = 2^16, because this way we can do twice as much work in parallel, but > we have to do Barrett reduction after levels 2 and 4 in the inverse NTT > computation. Thanks a lot for the explanation. It would be good to add it as a comment in the stubGenerator_x86_64_kyber.cpp. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2098491524