On Sat, 5 Apr 2025 00:27:05 GMT, Sandhya Viswanathan <sviswanat...@openjdk.org> wrote:
>> Ferenc Rakoczi has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Reacting to comment by Sandhya. > > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 345: > >> 343: >> 344: store4Xmms(coeffs, 0, xmm0_3, _masm); >> 345: store4Xmms(coeffs, 4 * XMMBYTES, xmm4_7, _masm); > > This seems to be unnecessary store. Thanks for catching that. Changed. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 370: > >> 368: loadPerm(xmm16_19, perms, nttL4PermsIdx, _masm); >> 369: loadPerm(xmm12_15, perms, nttL4PermsIdx + 64, _masm); >> 370: load4Xmms(xmm24_27, zetas, 4 * 512, _masm); // for level 3 > > The comment // for level3 is not relevant here and could be removed. Ooops. Deleted the comment. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 802: > >> 800: __ evpbroadcastd(zero, scratch, Assembler::AVX_512bit); // 0 >> 801: __ addl(scratch, 1); >> 802: __ evpbroadcastd(one, scratch, Assembler::AVX_512bit); // 1 > > A better way to initialize (0, 1, -1) vectors is: > // load 0 into int vector > vpxor(zero, zero, zero, Assembler::AVX_512bit); > // load -1 into int vector > vpternlogd(minusOne, 0xff, minusOne, minusOne, Assembler::AVX_512bit); > // load 1 into int vector > vpsubd(one, zero, minusOne, Assembler::AVX_512bit); > > Where minusOne could be xmm31. > > A broadcast from r register to xmm register is more expensive. Changed. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 982: > >> 980: __ evporq(xmm19, k0, xmm19, xmm23, false, Assembler::AVX_512bit); >> 981: >> 982: __ evpsubd(xmm12, k0, zero, one, false, Assembler::AVX_512bit); // -1 > > The -1 initialization could be done outside the loop. Not really. All registers are used. > src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 1015: > >> 1013: __ addptr(lowPart, 4 * XMMBYTES); >> 1014: __ cmpl(len, 0); >> 1015: __ jcc(Assembler::notEqual, L_loop); > > It looks to me that subl and cmpl could be merged: > __ addptr(highPart, 4 * XMMBYTES); > __ addptr(lowPart, 4 * XMMBYTES); > __ subl(len, 4 * XMMBYTES); > __ jcc(Assembler::notEqual, L_loop); Changed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057184 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057342 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057700 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057565 PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2034057463