Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v2]

Fei Gao Mon, 08 Jun 2026 01:18:30 -0700

On Fri, 29 May 2026 11:57:00 GMT, Emanuel Peter <[email protected]> wrote:


>> Taking Neoverse V2 as an example, `SVE mla` and `mul` for `long` types have 
>> the same execution latency.
>> 
>> The main idea behind this change is to improve parallelism between the two 
>> inputs of `AddVL`. When `add_input` is `MulVL` or `MLA` and has no 
>> dependency on `MulVL` (the other input of `AddVL`), we can benefit from 
>> executing `add_input` and `MulVL` in parallel. (However, this parallelism 
>> depends on hardware scheduling behavior and available resources, so it does 
>> not apply uniformly across all microarchitectures. Therefore, 
>> `AvoidMLAChain` is applied selectively.)
>> 
>> However, if `add_input` is also an input to `MulVL`, then it cannot execute 
>> in parallel with `MulVL`. In this case, fusing `MulVL` with `AddVL` can 
>> reduce the overall execution latency.
>> 
>> I added a new benchmark, `longAddDotProductShared`, in 
>> `test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java`. This 
>> benchmark shows that without these exception lines, performance drops by 
>> approximately `11%` compared to the current implementation.
>
>> The main idea behind this change is to improve parallelism between the two 
>> inputs of AddVL. When add_input is MulVL or MLA and has no dependency on 
>> MulVL (the other input of AddVL), we can benefit from executing add_input 
>> and MulVL in parallel. (However, this parallelism depends on hardware 
>> scheduling behavior and available resources, so it does not apply uniformly 
>> across all microarchitectures. Therefore, AvoidMLAChain is applied 
>> selectively.)
> 
> I think such an explanation would be great to have somewhere around in the 
> code!

Done. Thanks!

>> Yes. For a simple reduction loop like:
>> 
>> for (int i = 0; i < a.length; i++) {
>>     long val = a[i] * b[i];
>>     res += val;
>> }
>> 
>> the vectorized form generates a `phi` node with an `MLA` input in the loop, 
>> and we have tests covering this scenario in 
>> `test/hotspot/jtreg/compiler/vectorization/TestVmlaAArch64.java`.
>> 
>> Since this change only affects `VectorNode` handling, I have not observed 
>> any `if/else` branches in vectorized loops that trigger this pattern so far. 
>> It could theoretically happen in the future if autovectorization gains 
>> support for control flow inside loops, but I have not seen a concrete case 
>> yet.
>
> Right. Must have forgot that all operations are vector operations. Well, I 
> suppose we could hit such a pattern with the vector API. Have you checked for 
> that? Would it make sense to add a Vector API test that mirrors your 
> auto-vectorization pattern, and also one that has if/else patterns?

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3357210209
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3357331852

Re: RFR: 8372153: AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093 [v2]

Reply via email to