On Tue, 13 May 2025 17:53:50 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Restoring copyright notice on ML_KEM.java
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 693:
> 
>> 691: // a (short[256]) = c_rarg1
>> 692: // b (short[256]) = c_rarg2
>> 693: // kyberConsts (short[40]) = c_rarg3
> 
> kyberConsts is not one of the arguments passed in.

Fixed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 696:
> 
>> 694: address generate_kyberAddPoly_2_avx512(StubGenerator *stubgen,
>> 695:                                        MacroAssembler *_masm) {
>> 696: 
> 
> The Java code for "implKyberAddPoly(short[] result, short[] a, short[] b)" 
> does BarrettReduction but the intrinsic code here does not. Is that 
> intentional and how is the reduction handled?

Actually, the Java version is the one that is too cautious. There is Barrett 
reduction after at most 4 consecutive uses of mlKemAddPoly(), so doing the 
reduction in implKyberAddPoly() is not necessary. Thanks for discovering this!

> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 742:
> 
>> 740: // b (short[256]) = c_rarg2
>> 741: // c (short[256]) = c_rarg3
>> 742: // kyberConsts (short[40]) = c_rarg4
> 
> kyberConsts is not one of the arguments passed in.

Fixed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 799:
> 
>> 797: // parsedLength (int) = c_rarg3
>> 798: address generate_kyber12To16_avx512(StubGenerator *stubgen,
>> 799:                                     MacroAssembler *_masm) {
> 
> If AVX512_VBMI and AVX512_VBMI2 is available, it looks to me that the loop 
> body of this algorithm can be implemented using more efficient instructions 
> in simple 5 steps:
> 
> Step 1:
> Load 0-47, 48-95, 96-143, 144-191 condensed bytes into xmm0, xmm1, xmm2, xmm3 
> respectively using masked load.
> 
> Step 2:
> Use vpermb to arrange xmm0 such that bytes 1, 4, 7, ... are duplicated
> xmm0 before  b47, b46, ..., b0 where each b is a byte
> xmm0 after b47 b46 b46 b45, ......., b5 b4 b4 b3 b2 b1 b1 b0  
> Repeat this for xmm1, xmm2, xmm3
> 
> Step 3:
> Use vpshldvw to shift every word (16 bits) in the xmm0 appropriately with 
> variable shift
> Shift word 31 by 4, word 30 by 0, ... word 3 by 4, word 2 by 0,  word 1 by 4, 
> word 0 by 0
> Repeat this for xmm1, xmm2, xmm3
> 
> Step 4:
> Use vpand to "and" each word element in xmm0 by 0xfff.
> Repeat this for xmm1, xmm2, xmm3
> 
> Step 5:
> Store xmm0 into parsed
> Store xmm1 into parsed + 64
> Store xmm2 into parsed +128
> Store xmm3 into parsed + 192
> 
> If you think there is not sufficient time, we could look into it after the 
> merge of this PR as well.

Yes, that way we can speed this up a little (well, in itself it might  be 
significant), but with the current intrinsics, the contribution of this 
function to the overall running time is about 1.5%, so it would not matter that 
much, while on the other hand not all AVX-512 capable processors have vbmi.
So I would rather not do it in this PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738946
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738841
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738704
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738615

Reply via email to