Re: RFR: 8351412: Add AVX-512 intrinsics for ML-KEM [v6]

Ferenc Rakoczi Tue, 20 May 2025 04:58:00 -0700

On Fri, 16 May 2025 00:28:18 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:


>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Response to review comment + loading constants with broadcast op.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 250:
> 
>> 248: static void montmul(int outputRegs[], int inputRegs1[], int 
>> inputRegs2[],
>> 249:              int scratchRegs1[], int scratchRegs2[], MacroAssembler 
>> *_masm) {
>> 250:    for (int i = 0; i < 4; i++) {
> 
> In the intrinsic for montMul we are treating as if MONT_R_BITS is 16 and 
> MONT_Q_INV_MOD_R is 0xF301 whereas in the Java code MONT_R_BITS is 20 and 
> MONT_Q_INT_MOD_R is 0x8F301. Are these equivalent?

As used in this case, they are equivalent.  For z = montmul(a,b), z will be  
between -q and q and congruent to a * b * R^-1 mod q, where R > 2 * q, R is a 
power of 2, -R/2 * q <= a * b < R/2 * q. For the Java code, we use R = 2^20 and 
for the intrinsic, R = 2^16. In our computations, b is always c * R mod q, so 
the montmul() really  computes a * c mod q. In the Java code, we use 32-bit 
numbers for the computations, and we use R = 2^20 because that way the a * b 
numbers that occur during all computations stay in the required range (the 
inverse NTT computation is where they can grow the most), so we don't have to 
do Barrett reductions during that computation. For the intrinsics, we use R = 
2^16, because this way we can do twice as much work in parallel, but we have to 
do Barrett reduction after levels 2 and 4 in the inverse NTT computation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2097757145

Re: RFR: 8351412: Add AVX-512 intrinsics for ML-KEM [v6]

Reply via email to