On Mon, 2 Mar 2026 03:47:46 GMT, xinyangwu <[email protected]> wrote:
>> ### Summary
>> This PR introduces a parallel intrinsic for AES/ECB operations to replace
>> the current per-block processing approach, reducing native call overhead and
>> improving throughput for multi-block operations.
>> ### Problem
>> Except supporting AVX512, The existing AES/ECB implementation suffers from
>> three major performance issues:
>> 1. Excessive stub call overhead: Each 16-byte block requires a separate
>> intrinsic call, resulting in high invocation frequency
>>
>> 2. Inefficient instruction-level parallelism: The serialized block
>> processing fails to fully utilize instruction-level parallelism
>>
>> 3. Redundant setup/teardown: Repeated initialization of encryption state for
>> each block
>> ### Changes
>> Added parallel AES intrinsic implementation
>> ### Testing
>> JMH benchmarks
>>
>> It can bring about a **37.43%** performance improvement.
>>
>> On a Intel(R) Core(TM) i9-14900HX CPU machine with origin implements:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> AesTest.test avgt 5 11518.846 ± 68.621 ns/op
>>
>>
>> On the same machine with optimized implements:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> AesTest.test avgt 5 8381.499 ± 57.751 ns/op
>>
>>
>> All Tier-1 tests pass on linux-x64. This modification does not involve
>> changing the encryption or decryption logic.
>
> xinyangwu has updated the pull request with a new target base due to a merge
> or a rebase. The incremental webrev excludes the unrelated changes brought in
> by the merge/rebase. The pull request contains 11 additional commits since
> the last revision:
>
> - Remove trailing backslashes
> - Merge branch 'openjdk:master' into aes
> - refactor
> - Merge branch 'openjdk:master' into aes
> - 8376164: Optimize AES/ECB/PKCS5Padding implementation using full-message
> intrinsic stub and parallel RoundKey addition
> - Merge branch 'openjdk:master' into aes
> - 8376164: Optimize AES/ECB/PKCS5Padding implementation using full-message
> intrinsic stub and parallel RoundKey addition
> - Merge branch 'openjdk:master' into aes
> - Merge branch 'openjdk:master' into aes
> - Merge branch 'openjdk:master' into aes
> - ... and 1 more: https://git.openjdk.org/jdk/compare/41d1c23f...ef2effbc
src/hotspot/cpu/x86/stubGenerator_x86_64.hpp line 2:
> 1: /*
> 2: * Copyright (c) 2003, 2025, Oracle and/or its affiliates. All rights
> reserved.
The copyright year should be updated to 2026 on line 2.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2:
> 1: /*
> 2: * Copyright (c) 2019, 2025, Intel Corporation. All rights reserved.
Copyright year should be updated to 2026 on line 2 in
stubGenerator_x86_64_aes.cpp as well.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 1494:
> 1492: DoFour(pxor, xmm_key_tmp);
> 1493: for (int i = 1; i < rounds[k]; i++) {
> 1494: load_key(xmm_key_tmp, key, i * 0x10, xmm_key_shuf_mask);
We have 16 Xmm registers and we have used only 6 of them. We could use the
remaining 10 Xmm registers to load 10 of the keys prior to the L_loop4 and hold
them. That way we do not need to reload 10 of the keys again and again in the
loop.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 1560:
> 1558: // rax - input length
> 1559: //
> 1560: address
> StubGenerator::generate_electronicCodeBook_decryptAESCrypt_Parallel() {
The encrypt and decrypt methods are very similar, wonder if we could
parametrize and use one generate method to generate both?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2892554297
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2892560962
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2893105681
PR Review Comment: https://git.openjdk.org/jdk/pull/29385#discussion_r2893115075