Dear Security group and members, Hello,
I recently submitted a PR that introduces a parallel intrinsic implementation for AES/ECB operations, aiming to replace the current per-block processing approach and improve performance for multi-block encryption/decryption. This work is motivated by several performance limitations in the existing AES/ECB/PKCS5Padding implementation (except for AVX-512 support): 1. *Excessive stub call overhead* – each 16-byte block triggers a separate intrinsic call, leading to high invocation frequency. 2. *Limited instruction-level parallelism* – serialized block processing does not fully utilize available ILP. 3. *Redundant setup and teardown* – encryption state is repeatedly initialized for every block. Summary of changes - Added a parallel AES intrinsic implementation to process multiple blocks in a single native call. - Reduced intrinsic invocation overhead. - Improved utilization of instruction-level parallelism. Performance results (JMH) Test platform: Intel(R) Core(TM) i9-14900HX OpenJDK 17 baseline: Benchmark Mode Cnt Score Error Units AesTest.test avgt 5 13334.163 ± 220.891 ns/op With optimized implementation: Benchmark Mode Cnt Score Error Units AesTest.test avgt 5 10391.371 ± 94.966 ns/op This shows approximately *28.3% performance improvement*. I would greatly appreciate your feedback on: - The design of the parallel intrinsic approach - Any potential correctness or portability concerns - Suggestions for further optimization or alignment with HotSpot intrinsic conventions JBS Issue: https://bugs.openjdk.org/browse/JDK-8376164 — This issue tracks the performance improvement of AES/ECB operations by introducing a parallel intrinsic to reduce per-block overhead and enhance throughput. I am very happy to revise or extend the patch based on your guidance. Thank you for your time and for maintaining such a great platform. Best regards, Xinyang Wu
