Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]

Jatin Bhateja Thu, 21 Nov 2024 01:06:41 -0800

On Tue, 12 Nov 2024 21:49:22 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:


>>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I 
>>> don't see a compelling reason to do so. IMO matcher functionality is more 
>>> than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + 
>>> `has_int_inputs()` predicate. For `MulUIVL` you additionally do input 
>>> rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 
>>> 0xFFFFFFFF))` operands can be easily detected on matcher side (with an 
>>> extra AD instruction); and (2) such optimization is limited because it is 
>>> valid only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C 
>>> <= 0xFFFFFFFF`.
>>> 
>>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without 
>>> improving situation during matching.
>> 
>> Hi @iwanowww ,
>> Problem occurs only if AndV gets shared; in such a case, matcher will not be 
>> able to identify the constrained multiplication pattern and absorb the 
>> masking pattern. Specialized IR overrules such limitations and shields the 
>> pattern from downstream optimization passes, thereby removing any 
>> non-determinism. In addition, it facilitates forwarding inputs to the 
>> multiplier, the new IR is explicit in its semantics of considering only 
>> lower doublewords of quadword lanes for multiplication, hence we can safely 
>> save emitting redundant input masking instructions. We already have 
>> specialized IR nodes like MulAddVS2VINode and I see these new IR nodes 
>> similar to it.
>
> @jatin-bhateja in case when `AndV` is shared, it can't be eliminated unless 
> all users absorb it. For such cases, matcher can perform adhoc node cloning, 
> but in this particular case it looks like an overkill either way. IMO the 
> pattern is too niche to focus on it (either to justify input forwarding or 
> adhoc handling on matcher side). 
> 
> It's good you mentioned `MulAddVS2VI`. On one hand, VNNI operations are more 
> complex (similar to FMA), so such complexity *may* be justified there. On the 
> other hand, it doesn't look like VNNI support in C2 age well. It is tied to 
> auto-vectorizer and, by now, Vector API doesn't benefit from it. So, instead 
> of doubling down on `MulAddVS2VI` path, I'd prefer to leave it aside and 
> reimplement it later in a more maintainable manner.

Hi @iwanowww , your closing comments addressed, kindly re-approve, the change 
is mainly in newly added test file.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2490437406

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]

Reply via email to