Dear Security group and members,

Hello,

I recently submitted a PR that introduces a parallel intrinsic
implementation for AES/ECB operations, aiming to replace the current
per-block processing approach and improve performance for multi-block
encryption/decryption.

This work is motivated by several performance limitations in the existing
AES/ECB/PKCS5Padding implementation (except for AVX-512 support):

   1.

   *Excessive stub call overhead* – each 16-byte block triggers a separate
   intrinsic call, leading to high invocation frequency.
   2.

   *Limited instruction-level parallelism* – serialized block processing
   does not fully utilize available ILP.
   3.

   *Redundant setup and teardown* – encryption state is repeatedly
   initialized for every block.

Summary of changes

   -

   Added a parallel AES intrinsic implementation to process multiple blocks
   in a single native call.
   -

   Reduced intrinsic invocation overhead.
   -

   Improved utilization of instruction-level parallelism.

Performance results (JMH)

Test platform: Intel(R) Core(TM) i9-14900HX
OpenJDK 17 baseline:

Benchmark Mode Cnt Score Error Units
AesTest.test avgt 5 13334.163 ± 220.891 ns/op

With optimized implementation:

Benchmark Mode Cnt Score Error Units
AesTest.test avgt 5 10391.371 ± 94.966 ns/op

This shows approximately *28.3% performance improvement*.

I would greatly appreciate your feedback on:

   -

   The design of the parallel intrinsic approach
   -

   Any potential correctness or portability concerns
   -

   Suggestions for further optimization or alignment with HotSpot intrinsic
   conventions

JBS Issue: https://bugs.openjdk.org/browse/JDK-8376164 — This issue tracks
the performance improvement of AES/ECB operations by introducing a parallel
intrinsic to reduce per-block overhead and enhance throughput.

I am very happy to revise or extend the patch based on your guidance.
Thank you for your time and for maintaining such a great platform.

Best regards,
Xinyang Wu

Reply via email to