On Sun, 11 Jan 2026 09:33:43 GMT, Jatin Bhateja <[email protected]> wrote:
>> Just a note on LoopAlignment, there are multiple moving parts here, first >> aligning starting addresses of loop to 64 ([recommendation from Zen5 >> optimization guide](https://docs.amd.com/v/u/en-US/58455_1.00) section >> 2.8.3) ensure small loop bodies are not split-across the cache line, if that >> happens then there is a cold entry penalty in the first iteration of loop, >> where front-end will have to read multiple L1I cache lines, once its decoded >> and uops are part of Op-cache (AMD) or DSB (Intel). There onwards uops >> stream for successive loop iterations are issued from op-cache. Since >> op-cache is shared b/w 2 HW threads in SMT configuration hence in case of >> noisy neighbor scenarios or context-switches we may hit cold-entry penalty >> during lifetime of loop. >> >> So its advisable to add alignment in this case for other labels before loops >> we already have OptoLoopAlignment in place. > >> > Better to align loop sarting address to OptoLoopAlignment >> >> For parity, should I do this for the other labels in the file as well? >> >> > I will run the micro benchmark on AMD Turin and report back by early next >> > week. >> >> That would be great, thank you for doing this! > > Here are the score on Turin. > > > Baseline: > Benchmark (algorithm) (keyLength) > (provider) Mode Cnt Score Error Units > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-512 0 > thrpt 2 62235.790 ops/s > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-768 0 > thrpt 2 38238.390 ops/s > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-1024 0 > thrpt 2 24725.512 ops/s > > Withopt: > Benchmark (algorithm) (keyLength) > (provider) Mode Cnt Score Error Units > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-512 0 > thrpt 2 62483.697 ops/s > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-768 0 > thrpt 2 38464.272 ops/s > KeyPairGeneratorBench.MLKEM.generateKeyPair ML-KEM-1024 0 > thrpt 2 24702.044 ops/s > > > > Baseline: > Benchmark (algorithm) (provider) Mode Cnt Score Error > Units > KEMBench.decapsulate ML-KEM-512 thrpt 2 46416.479 > ops/s > KEMBench.decapsulate ML-KEM-768 thrpt 2 28516.289 > ops/s > KEMBench.decapsulate ML-KEM-1024 thrpt 2 19250.020 > ops/s > KEMBench.encapsulate ML-KEM-512 thrpt 2 60374.724 > ops/s > KEMBench.encapsulate ML-KEM-768 thrpt 2 36226.100 > ops/s > KEMBench.encapsulate ML-KEM-1024 thrpt 2 23656.223 > ops/s > > Withopt: > Benchmark (algorithm) (provider) Mode Cnt Score Error > Units > KEMBench.decapsulate ML-KEM-512 thrpt 2 46730.153 > ops/s > KEMBench.decapsulate ML-KEM-768 thrpt 2 28650.349 > ops/s > KEMBench.decapsulate ML-KEM-1024 thrpt 2 19390.927 > ops/s > KEMBench.encapsulate ML-KEM-512 thrpt 2 60238.211 > ops/s > KEMBench.encapsulate ML-KEM-768 thrpt 2 36454.138 > ops/s > KEMBench.encapsulate ML-KEM-1024 thrpt 2 23649.839 > ops/s > > > System was... Thank you for sharing these results. It is disconcerting to see the drop in performance for i) key gen-1024, ii) encapsulation-512, and iii) enacapsulation-1024, though I don't know the SE for these runs. During my testing on a AMD EPYC 9J14 96-Core Processor I consistently get noticeable performance increases for all ML-KEM operations: [Publish ML_KEM Benchmarks - Sheet1.pdf](https://github.com/user-attachments/files/24559070/Publish.ML_KEM.Benchmarks.-.Sheet1.pdf) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2681114748
