On Fri, 29 May 2026 11:57:00 GMT, Emanuel Peter <[email protected]> wrote:
>> Taking Neoverse V2 as an example, `SVE mla` and `mul` for `long` types have
>> the same execution latency.
>>
>> The main idea behind this change is to improve parallelism between the two
>> inputs of `AddVL`. When `add_input` is `MulVL` or `MLA` and has no
>> dependency on `MulVL` (the other input of `AddVL`), we can benefit from
>> executing `add_input` and `MulVL` in parallel. (However, this parallelism
>> depends on hardware scheduling behavior and available resources, so it does
>> not apply uniformly across all microarchitectures. Therefore,
>> `AvoidMLAChain` is applied selectively.)
>>
>> However, if `add_input` is also an input to `MulVL`, then it cannot execute
>> in parallel with `MulVL`. In this case, fusing `MulVL` with `AddVL` can
>> reduce the overall execution latency.
>>
>> I added a new benchmark, `longAddDotProductShared`, in
>> `test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java`. This
>> benchmark shows that without these exception lines, performance drops by
>> approximately `11%` compared to the current implementation.
>
>> The main idea behind this change is to improve parallelism between the two
>> inputs of AddVL. When add_input is MulVL or MLA and has no dependency on
>> MulVL (the other input of AddVL), we can benefit from executing add_input
>> and MulVL in parallel. (However, this parallelism depends on hardware
>> scheduling behavior and available resources, so it does not apply uniformly
>> across all microarchitectures. Therefore, AvoidMLAChain is applied
>> selectively.)
>
> I think such an explanation would be great to have somewhere around in the
> code!
Done. Thanks!
>> Yes. For a simple reduction loop like:
>>
>> for (int i = 0; i < a.length; i++) {
>> long val = a[i] * b[i];
>> res += val;
>> }
>>
>> the vectorized form generates a `phi` node with an `MLA` input in the loop,
>> and we have tests covering this scenario in
>> `test/hotspot/jtreg/compiler/vectorization/TestVmlaAArch64.java`.
>>
>> Since this change only affects `VectorNode` handling, I have not observed
>> any `if/else` branches in vectorized loops that trigger this pattern so far.
>> It could theoretically happen in the future if autovectorization gains
>> support for control flow inside loops, but I have not seen a concrete case
>> yet.
>
> Right. Must have forgot that all operations are vector operations. Well, I
> suppose we could hit such a pattern with the vector API. Have you checked for
> that? Would it make sense to add a Vector API test that mirrors your
> auto-vectorization pattern, and also one that has if/else patterns?
Done.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3357210209
PR Review Comment: https://git.openjdk.org/jdk/pull/30237#discussion_r3357331852