On Fri, 22 May 2026 02:46:29 GMT, Volodymyr Paprotski <[email protected]> wrote:
>> This PR: >> - changes existing AVX512 SHA3 intrinsic to be more parallel >> - adds an AVX2 SHA3 intrinsic >> - change `SHA3Parallel.java` to NR=4 (to be able to exploit the AVX512 >> parallelism while keeping doubleKeccak for platforms where double >> parallelism is preferable. I experimented with NR=8 as well, does also gain >> a few percent, but I think NR=4 is sufficient tradeoff) >> >> Performance gains: >> - `MessageDigestBench.digest`: >> - AVX2: **16%-39%** >> - AVX512: **24%-33%** >> - `SignatureBench.MLDSA.sign` >> - AVX2: **6-12%** >> - AVX512: **11%-18%** >> - `SignatureBench.MLDSA.verify` >> - AVX2: **2%-14%** >> - AVX512: **31%-40%** >> - `KEMBench.MLKEM` >> - AVX2: **~5%** >> - AVX512: **14%-23%** >> - `KEMBench.JSSE_*` >> - appears unaffected >> >> Note on intrinsics. (As noted in the code..) there are multiple entrypoints >> wrapping the same intrinsic.. >> - `SHA3.implCompress`: single blockSize of user data xored with keccak >> - `DigestBase.implCompressMultiBlock`: loop over user data and xor with >> keccak >> - `SHA3Parallel.doubleKeccak`: (still used for AVX2) no message data, just >> two state vectors >> - `SHA3Parallel.quadKeccak`: (AVX512 benefit) no message data, four state >> vectors >> >> Note 1: `make test >> TEST="micro:org.openjdk.bench.javax.crypto.full.MessageDigestBench >> micro:org.openjdk.bench.javax.crypto.full.SignatureBench.MLDSA >> micro:org.openjdk.bench.javax.crypto.full.KEMBench"` >> Note 2: I have left more targeted fuzzing and benchmarks out of this PR, but >> they are preserved at [on my >> branch](https://github.com/vpaprotsk/jdk/compare/sha3-avx-quad...vpaprotsk:jdk:sha3-avx-quad-extras?expand=1). >> If there is something you rather see pulled in.. (otherwise, can include a >> diff in JBS for 'future reference') >> >> --------- >> - [X] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Volodymyr Paprotski has updated the pull request incrementally with one > additional commit since the last revision: > > Comments from Aleksey Shipilev The AVX2 benchmarks show mixed results (see attached) on my Intel Core i9-12900K Alder Lake 3.2GHz 24-Core w/32GB main memory: ML-KEM decapsulation: -2% to 0% delta ML-KEM encapsulation: -2% to 2% delta ML-KEM key generation: -1% to 5% delta ML-DSA sign of 1024 bytes: 0% to 5% delta ML-DSA sign of 16384 bytes: -5% to 1% delta ML-DSA verify of 1024 bytes: 6% to 12% delta ML-DSA verify of 16384 bytes: -7% to 3% delta ML-DSA key generation: 6% to 12% delta As you can see from above and the attachment, the regression in performance is i) tied to data size for sign/verify operations and ii) for ML-KEM's smaller key sizes. For i), AVX-2 has to do a number of shuffles (3 instructions) per round for the two 128 bit states, where the C2 inlining for rotations are probably already efficient in this area. For ii), there is less work to do when expanding/generating the A matrix for the smaller key sizes. Other slowdowns compared to AVX-512 could be that AVX-2 does not support a true quad Keccak and could pay a higher price for unused lanes. [Intrinsics ML-KEM_ML-DSA Benchmarks - i9-8384353.pdf](https://github.com/user-attachments/files/28207824/Intrinsics.ML-KEM_ML-DSA.Benchmarks.-.i9-8384353.pdf) ------------- PR Comment: https://git.openjdk.org/jdk/pull/31125#issuecomment-4532071028
