On Wed, 2 Apr 2025 07:38:34 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:
>> By using the AVX-512 vector registers the speed of the computation of the >> ML-DSA algorithms (key generation, document signing, signature verification) >> can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional > commit since the last revision: > > Reacting to comment by Sandhya. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 802: > 800: __ evpbroadcastd(zero, scratch, Assembler::AVX_512bit); // 0 > 801: __ addl(scratch, 1); > 802: __ evpbroadcastd(one, scratch, Assembler::AVX_512bit); // 1 A better way to initialize (0, 1, -1) vectors is: // load 0 into int vector vpxor(zero, zero, zero, Assembler::AVX_512bit); // load -1 into int vector vpternlogd(minusOne, 0xff, minusOne, minusOne, Assembler::AVX_512bit); // load 1 into int vector vpsubd(one, zero, minusOne, Assembler::AVX_512bit); Where minusOne could be xmm31. A broadcast from r register to xmm register is more expensive. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 982: > 980: __ evporq(xmm19, k0, xmm19, xmm23, false, Assembler::AVX_512bit); > 981: > 982: __ evpsubd(xmm12, k0, zero, one, false, Assembler::AVX_512bit); // -1 The -1 initialization could be done outside the loop. src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1015: > 1013: __ addptr(lowPart, 4 * XMMBYTES); > 1014: __ cmpl(len, 0); > 1015: __ jcc(Assembler::notEqual, L_loop); It looks to me that subl and cmpl could be merged: __ addptr(highPart, 4 * XMMBYTES); __ addptr(lowPart, 4 * XMMBYTES); __ subl(len, 4 * XMMBYTES); __ jcc(Assembler::notEqual, L_loop); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2032172061 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2032171059 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2031979828