On Sun, 10 Nov 2024 07:40:30 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I >> don't see a compelling reason to do so. IMO matcher functionality is more >> than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + >> `has_int_inputs()` predicate. For `MulUIVL` you additionally do input >> rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate >> 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra >> AD instruction); and (2) such optimization is limited because it is valid >> only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= >> 0xFFFFFFFF`. >> >> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without >> improving situation during matching. > >> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I >> don't see a compelling reason to do so. IMO matcher functionality is more >> than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + >> `has_int_inputs()` predicate. For `MulUIVL` you additionally do input >> rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate >> 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra >> AD instruction); and (2) such optimization is limited because it is valid >> only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= >> 0xFFFFFFFF`. >> >> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without >> improving situation during matching. > > Hi @iwanowww , > Problem occurs only if AndV gets shared; in such a case, matcher will not be > able to identify the constrained multiplication pattern and absorb the > masking pattern. Specialized IR overrules such limitations and shields the > pattern from downstream optimization passes, thereby removing any > non-determinism. In addition, it facilitates forwarding inputs to the > multiplier, the new IR is explicit in its semantics of considering only lower > doublewords of quadword lanes for multiplication, hence we can safely save > emitting redundant input masking instructions. We already have specialized IR > nodes like MulAddVS2VINode and I see these new IR nodes similar to it. @jatin-bhateja in case when `AndV` is shared, it can't be eliminated unless all users absorb it. For such cases, matcher can perform adhoc node cloning, but in this particular case it looks like an overkill either way. IMO the pattern is too niche to focus on it (either to justify input forwarding or adhoc handling on matcher side). It's good you mentioned `MulAddVS2VI`. On one hand, VNNI operations are more complex (similar to FMA), so such complexity *may* be justified there. On the other hand, it doesn't look like VNNI support in C2 age well. It is tied to auto-vectorizer and, by now, Vector API doesn't benefit from it. So, instead of doubling down on `MulAddVS2VI` path, I'd prefer to leave it aside and reimplement it later in a more maintainable manner. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2471654154