On Mon, 22 Jun 2026 21:02:35 GMT, Ferenc Rakoczi <[email protected]> wrote:
> Apparently, it is C2 generating optimal code. Ok, so it seems like we don't need to generate the assign intrinsic on AArch64. @ferakocz Can you provide numbers to confirm that omitting generation of the assign intrinsic makes no difference to the crypto computation? An interesting follow-up question is whether we need the intrinsic on x86 and if so why? It's a little more complicated since we have slightly different code depending on whether UseAVX > 2 but C2 also knows about and responds to UseAVX so one might expect it to also generate good-to-optimal code for x86. I ran the micro benchmark on my x86 box (AMD Ryzen 9 7900 12 cores) first with the assign intrinsic generation enabled and then with it disabled and got the following results Intrinsic enabled: > PolynomialP256Bench.benchAssign true thrpt 8 9712.115 ± > 321.576 ops/s Intrinsic disabled: > PolynomialP256Bench.benchAssign true thrpt 8 18814.815 ± > 7985.590 ops/s I'm not sure why the variance is so high in the disabled case but for this HW (where we have AVX2 and AVX512 support) it looks like a similar picture. So, . . . @ferakocz 1. If you can show that the assign intrinsic does not improve performance of the crytpo tests then modify this PR not to generate it and remove the generator method. That will allow us to push the multiply intrinsic. 2. As a separate step can you or whoever implemented the x86 intrinsic check whether it gives any benefits and if not then raise a JIRA and PR either to disable generation on x86 or, if we have also dropped it for aarch64, remove it completely. ------------- PR Comment: https://git.openjdk.org/jdk/pull/30941#issuecomment-4777433148
