Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI [v4]

Shawn M Emery Sun, 11 Jan 2026 23:27:40 -0800

On Sun, 11 Jan 2026 09:33:43 GMT, Jatin Bhateja <[email protected]> wrote:


>> Just a note on LoopAlignment, there are multiple moving parts here, first 
>> aligning starting addresses of loop to 64 ([recommendation from Zen5 
>> optimization guide](https://docs.amd.com/v/u/en-US/58455_1.00) section 
>> 2.8.3) ensure small loop bodies are not split-across the cache line, if that 
>> happens then there is a cold entry penalty in the first iteration of loop, 
>> where front-end will have to read multiple L1I cache lines, once its decoded 
>> and uops are part of Op-cache (AMD) or DSB (Intel). There onwards uops 
>> stream for successive loop iterations are issued from op-cache. Since 
>> op-cache is shared b/w 2 HW threads in SMT configuration hence in case of 
>> noisy neighbor scenarios or context-switches we may hit cold-entry penalty 
>> during lifetime of loop. 
>> 
>> So its advisable to add alignment in this case for other labels before loops 
>> we already have OptoLoopAlignment in place.
>
>> > Better to align loop sarting address to OptoLoopAlignment
>> 
>> For parity, should I do this for the other labels in the file as well?
>> 
>> > I will run the micro benchmark on AMD Turin and report back by early next 
>> > week.
>> 
>> That would be great, thank you for doing this!
> 
> Here are the score on Turin.
> 
> 
> Baseline:
> Benchmark                                    (algorithm)  (keyLength)  
> (provider)   Mode  Cnt      Score   Error  Units
> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0         
>      thrpt    2  62235.790          ops/s
> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0         
>      thrpt    2  38238.390          ops/s
> KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0         
>      thrpt    2  24725.512          ops/s
> 
> Withopt:
> Benchmark                                    (algorithm)  (keyLength)  
> (provider)   Mode  Cnt      Score   Error  Units
> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-512            0         
>      thrpt    2  62483.697          ops/s
> KeyPairGeneratorBench.MLKEM.generateKeyPair   ML-KEM-768            0         
>      thrpt    2  38464.272          ops/s
> KeyPairGeneratorBench.MLKEM.generateKeyPair  ML-KEM-1024            0         
>      thrpt    2  24702.044          ops/s
> 
> 
> 
> Baseline:
> Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error  
> Units
> KEMBench.decapsulate   ML-KEM-512              thrpt    2  46416.479          
> ops/s
> KEMBench.decapsulate   ML-KEM-768              thrpt    2  28516.289          
> ops/s
> KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19250.020          
> ops/s
> KEMBench.encapsulate   ML-KEM-512              thrpt    2  60374.724          
> ops/s
> KEMBench.encapsulate   ML-KEM-768              thrpt    2  36226.100          
> ops/s
> KEMBench.encapsulate  ML-KEM-1024              thrpt    2  23656.223          
> ops/s
> 
> Withopt:
> Benchmark             (algorithm)  (provider)   Mode  Cnt      Score   Error  
> Units
> KEMBench.decapsulate   ML-KEM-512              thrpt    2  46730.153          
> ops/s
> KEMBench.decapsulate   ML-KEM-768              thrpt    2  28650.349          
> ops/s
> KEMBench.decapsulate  ML-KEM-1024              thrpt    2  19390.927          
> ops/s
> KEMBench.encapsulate   ML-KEM-512              thrpt    2  60238.211          
> ops/s
> KEMBench.encapsulate   ML-KEM-768              thrpt    2  36454.138          
> ops/s
> KEMBench.encapsulate  ML-KEM-1024              thrpt    2  23649.839          
> ops/s
> 
> 
> System was...

Thank you for sharing these results.  It is disconcerting to see the drop in 
performance for i) key gen-1024, ii) encapsulation-512, and iii) 
enacapsulation-1024, though I don't know the SE for these runs.  During my 
testing on a AMD EPYC 9J14 96-Core Processor I consistently get noticeable 
performance increases for all ML-KEM operations:

[Publish ML_KEM Benchmarks - 
Sheet1.pdf](https://github.com/user-attachments/files/24559070/Publish.ML_KEM.Benchmarks.-.Sheet1.pdf)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28815#discussion_r2681114748

Re: RFR: 8360934: Add AVX-512 intrinsics for ML-KEM - enhancement on AVX512_VBMI [v4]

Reply via email to