Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]

Vladimir Ivanov Tue, 12 Nov 2024 13:54:22 -0800

On Sun, 10 Nov 2024 07:40:30 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I 
>> don't see a compelling reason to do so. IMO matcher functionality is more 
>> than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + 
>> `has_int_inputs()` predicate. For `MulUIVL` you additionally do input 
>> rewiring (using `forward_masked_input`), but (1)  `AndV src (Replicate 
>> 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra 
>> AD instruction); and (2) such optimization is limited because it is valid 
>> only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <=  
>> 0xFFFFFFFF`.
>> 
>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without 
>> improving situation during matching.
>
>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I 
>> don't see a compelling reason to do so. IMO matcher functionality is more 
>> than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + 
>> `has_int_inputs()` predicate. For `MulUIVL` you additionally do input 
>> rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 
>> 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra 
>> AD instruction); and (2) such optimization is limited because it is valid 
>> only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= 
>> 0xFFFFFFFF`.
>> 
>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without 
>> improving situation during matching.
> 
> Hi @iwanowww ,
> Problem occurs only if AndV gets shared; in such a case, matcher will not be 
> able to identify the constrained multiplication pattern and absorb the 
> masking pattern. Specialized IR overrules such limitations and shields the 
> pattern from downstream optimization passes, thereby removing any 
> non-determinism. In addition, it facilitates forwarding inputs to the 
> multiplier, the new IR is explicit in its semantics of considering only lower 
> doublewords of quadword lanes for multiplication, hence we can safely save 
> emitting redundant input masking instructions. We already have specialized IR 
> nodes like MulAddVS2VINode and I see these new IR nodes similar to it.

@jatin-bhateja in case when `AndV` is shared, it can't be eliminated unless all 
users absorb it. For such cases, matcher can perform adhoc node cloning, but in 
this particular case it looks like an overkill either way. IMO the pattern is 
too niche to focus on it (either to justify input forwarding or adhoc handling 
on matcher side). 

It's good you mentioned `MulAddVS2VI`. On one hand, VNNI operations are more 
complex (similar to FMA), so such complexity *may* be justified there. On the 
other hand, it doesn't look like VNNI support in C2 age well. It is tied to 
auto-vectorizer and, by now, Vector API doesn't benefit from it. So, instead of 
doubling down on `MulAddVS2VI` path, I'd prefer to leave it aside and 
reimplement it later in a more maintainable manner.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2471654154

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]

Reply via email to