On Tue, 4 Nov 2025 16:38:49 GMT, Volodymyr Paprotski <[email protected]> 
wrote:

> - New AVX2 intrinsics are 1.6x-6.9x faster than Java baseline 
>    - `SignatureBench.MLDSA` is 1.2x-2.2x faster
>    - Note: there is no AVX2-SHA3 intrinsics yet (Being reviewed 
> https://github.com/vpaprotsk/jdk/pull/7)
> - AVX512 intrinsic improvements are 1.24x-1.5x faster then current version 
>   - `SignatureBench.MLDSA` is upto 5% faster, never slower
> 
> Note on intrinsic:
> - The emitted (existing) AVX512 assembler was not "significantly" changed; 
> mostly more efficient instruction selection and tighter register allocation, 
> which allowed removal of NTT loop and stack spill.
> - Code was refactored to allow reuse of same assembler (as possible) for 
> AVX512 and AVX2
> 
> Tests and benchmarks:
> - Added a fuzz test to ensure Java and intrinsic produces exactly same result
> - Added benchmark to measure the performance of intrinsic itself
> 
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java"
> make test TEST="test/jdk/sun/security/provider/acvp/Launcher.java 
> test/jdk/sun/security/provider/acvp/ML_DSA_Intrinsic_Test.java" 
> JTREG="JAVA_OPTIONS=-XX:UseAVX=2"
> make test 
> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
> -XX:+UseDilithiumIntrinsics;FORK=1"
> make test 
> TEST="micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA" 
> MICRO="JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions 
> -XX:-UseDilithiumIntrinsics;FORK=1"

src/hotspot/cpu/x86/assembler_x86.cpp line 3865:

> 3863: void Assembler::vmovsldup(XMMRegister dst, XMMRegister src, int 
> vector_len) {
> 3864:     assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 3865:             (vector_len == AVX_256bit ? VM_Version::supports_avx2() :

Vector length 256 bit is supported by AVX=1.

src/hotspot/cpu/x86/assembler_x86.cpp line 3874:

> 3872: void Assembler::vmovshdup(XMMRegister dst, XMMRegister src, int 
> vector_len) {
> 3873:   assert(vector_len == AVX_128bit ? VM_Version::supports_avx() :
> 3874:           (vector_len == AVX_256bit ? VM_Version::supports_avx2() :

Vector length 256 bit is supported by AVX=1.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 83:

> 81: // size 0 and 1 are used for initial and final shuffles respectivelly of
> 82: // dilithiumAlmostInverseNtt and dilithiumAlmostNtt.
> 83: // NOTE: For size 0 and 1, input1[] and input2[] are modified in-place

what is the size-in-bits when size is 0 and 1? What is the difference between 
size 0 and size1? The overloading of size makes it confusing.

src/hotspot/cpu/x86/stubGenerator_x86_64_dilithium.cpp line 137:

> 135:           for (int i = 0; i < regCnt; i++) {
> 136:             // 0b-1-2-3-1
> 137:             __ vshufps(output2[i], input1[i], input2[i], 0b11011101, 
> vector_len);

Did you mean this to be //0b-1-3-1-3?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528279719
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528288894
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528416321
PR Review Comment: https://git.openjdk.org/jdk/pull/28136#discussion_r2528610634

Reply via email to