On Tue, 13 May 2025 17:53:50 GMT, Sandhya Viswanathan
<[email protected]> wrote:
>> Ferenc Rakoczi has updated the pull request incrementally with one
>> additional commit since the last revision:
>>
>> Restoring copyright notice on ML_KEM.java
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 693:
>
>> 691: // a (short[256]) = c_rarg1
>> 692: // b (short[256]) = c_rarg2
>> 693: // kyberConsts (short[40]) = c_rarg3
>
> kyberConsts is not one of the arguments passed in.
Fixed.
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 696:
>
>> 694: address generate_kyberAddPoly_2_avx512(StubGenerator *stubgen,
>> 695: MacroAssembler *_masm) {
>> 696:
>
> The Java code for "implKyberAddPoly(short[] result, short[] a, short[] b)"
> does BarrettReduction but the intrinsic code here does not. Is that
> intentional and how is the reduction handled?
Actually, the Java version is the one that is too cautious. There is Barrett
reduction after at most 4 consecutive uses of mlKemAddPoly(), so doing the
reduction in implKyberAddPoly() is not necessary. Thanks for discovering this!
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 742:
>
>> 740: // b (short[256]) = c_rarg2
>> 741: // c (short[256]) = c_rarg3
>> 742: // kyberConsts (short[40]) = c_rarg4
>
> kyberConsts is not one of the arguments passed in.
Fixed.
> src/hotspot/cpu/x86/stubGenerator_x86_64_kyber.cpp line 799:
>
>> 797: // parsedLength (int) = c_rarg3
>> 798: address generate_kyber12To16_avx512(StubGenerator *stubgen,
>> 799: MacroAssembler *_masm) {
>
> If AVX512_VBMI and AVX512_VBMI2 is available, it looks to me that the loop
> body of this algorithm can be implemented using more efficient instructions
> in simple 5 steps:
>
> Step 1:
> Load 0-47, 48-95, 96-143, 144-191 condensed bytes into xmm0, xmm1, xmm2, xmm3
> respectively using masked load.
>
> Step 2:
> Use vpermb to arrange xmm0 such that bytes 1, 4, 7, ... are duplicated
> xmm0 before b47, b46, ..., b0 where each b is a byte
> xmm0 after b47 b46 b46 b45, ......., b5 b4 b4 b3 b2 b1 b1 b0
> Repeat this for xmm1, xmm2, xmm3
>
> Step 3:
> Use vpshldvw to shift every word (16 bits) in the xmm0 appropriately with
> variable shift
> Shift word 31 by 4, word 30 by 0, ... word 3 by 4, word 2 by 0, word 1 by 4,
> word 0 by 0
> Repeat this for xmm1, xmm2, xmm3
>
> Step 4:
> Use vpand to "and" each word element in xmm0 by 0xfff.
> Repeat this for xmm1, xmm2, xmm3
>
> Step 5:
> Store xmm0 into parsed
> Store xmm1 into parsed + 64
> Store xmm2 into parsed +128
> Store xmm3 into parsed + 192
>
> If you think there is not sufficient time, we could look into it after the
> merge of this PR as well.
Yes, that way we can speed this up a little (well, in itself it might be
significant), but with the current intrinsics, the contribution of this
function to the overall running time is about 1.5%, so it would not matter that
much, while on the other hand not all AVX-512 capable processors have vbmi.
So I would rather not do it in this PR.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738946
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738841
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738704
PR Review Comment: https://git.openjdk.org/jdk/pull/24953#discussion_r2088738615