On Sat, 22 Mar 2025 20:02:31 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:

>> By using the AVX-512 vector registers the speed of the computation of the 
>> ML-DSA algorithms (key generation, document signing, signature verification) 
>> can be approximately doubled.
>
> Ferenc Rakoczi has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Further readability improvements.
>  - Added asserts for array sizes

src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342:

> 340: // Performs two keccak() computations in parallel. The steps of the
> 341: // two computations are executed interleaved.
> 342: static address generate_double_keccak(StubGenerator *stubgen, 
> MacroAssembler *_masm) {

This function seems ok. I didnt do as line-by-line 'exact' review as for the 
NTT intrinsics, but just put the new version into a diff next to the original 
function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, add 
new input registers 10-14. Makes it really easy to spot vs 0-4 original 
registers..)

I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt 
matter, there are registers to spare aplenty and it makes the entire algorithm 
cleaner and easier to follow.

I did also stare at the algorithm with the 'What about AVX2' question.. This 
function would pretty much need to be rewritten it looks like :/

Last two questions.. 
- how much performance is gained from doubling this function up?
- If thats worth it.. what if instead it was quadrupled the input? (I scanned 
the java code, it looked like NR was parametrized already to 2..). It looks 
like there are almost enough registers here to go to 4 (I think 3 would need to 
be freed up somehow.. alternatively, the upper 3 limbs are empty in all 
operations, perhaps it could be used instead.. at the expense of readability)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2017636762

Reply via email to