Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Andrew Dinn Tue, 23 Jun 2026 01:55:19 -0700

On Mon, 22 Jun 2026 21:02:35 GMT, Ferenc Rakoczi <[email protected]> wrote:


> Apparently, it is C2 generating optimal code.

Ok, so it seems like we don't need to generate the assign intrinsic on AArch64. 
@ferakocz Can you provide numbers to confirm that omitting generation of the 
assign intrinsic makes no difference to the crypto computation?

An interesting follow-up question is whether we need the intrinsic on x86 and 
if so why? It's a little more complicated since we have slightly different code 
depending on whether UseAVX > 2 but C2 also knows about and responds to UseAVX 
so one might expect it to also generate good-to-optimal code for x86.

I ran the micro benchmark on my x86 box (AMD Ryzen 9 7900 12 cores) first with 
the assign intrinsic generation enabled and then with it disabled and got the 
following results

Intrinsic enabled:

>    PolynomialP256Bench.benchAssign           true  thrpt    8  9712.115 ± 
> 321.576  ops/s

Intrinsic disabled:

>    PolynomialP256Bench.benchAssign           true  thrpt    8  18814.815 ± 
> 7985.590  ops/s

I'm not sure why the variance is so high in the disabled case but for this HW 
(where we have AVX2 and AVX512 support) it looks like a similar picture. So, . 
. .

@ferakocz 
1. If you can show that the assign intrinsic does not improve performance of 
the crytpo tests then modify this PR not to generate it and remove the 
generator method. That will allow us to push the multiply intrinsic.
2. As a separate step can you or whoever implemented the x86 intrinsic check 
whether it gives any benefits and if not then raise a JIRA and PR either to 
disable generation on x86 or, if we have also dropped it for aarch64, remove it 
completely.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/30941#issuecomment-4777433148

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v11]

Reply via email to