On Tue, 3 Mar 2026 06:12:23 GMT, Xiaohong Gong <[email protected]> wrote:
> Duplicate `ptrue`(`MaskAll`) instructions are generated with different > predicate registers on SVE when multiple `VectorMask.not()` operations exist. > This increases the predicate register pressure and reduces performance, > especially after loop is unrolled. > > Root cause: the matcher clones `MaskAll` for each `not` pattern (i.e. > `(XorVMask (MaskAll m1))`), but SVE has no match rule for that alone. And the > cloned `MaskAll` nodes are not shared with each other. > > Since SVE has rules for the `andNot` pattern: > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))); > > `MaskAll` node should be cloned only when it is part of the `andNot` pattern > instead. > > A second issue: `AndVMask`, `OrVMask`, and `XorVMask` are not in the > matcher's commutative vector op list, so their operands are never swapped. As > a result, the `andNot` rule does not match when the `XorVMask` operands > appear in the opposite order (e.g. `(XorVMask (MaskAll m1) pm)`). > > This patch fixes both issues by 1) limiting when `MaskAll` is cloned and 2) > adding the three binary mask bitwise IRs to the commutative op list. > > Following is the performance result of the new added JMH tested on V1 and > Grace(V2) machines respecitively: > > V1 (SVE machine with 256-bit vector length): > > Benchmark Mode Threads > Samples Unit size Before After Gain > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 256 54465.231 74374.960 1.365 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 512 29156.881 39601.358 1.358 > MaskLogicOperationsBenchmark.byteMaskAndNot thrpt 1 > 30 ops/ms 1024 15169.894 20272.379 1.336 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 256 15408.510 19808.722 1.285 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 512 7906.952 10297.837 1.302 > MaskLogicOperationsBenchmark.intMaskAndNot thrpt 1 > 30 ops/ms 1024 3767.122 5097.853 1.353 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 256 7762.614 10534.290 1.357 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 512 3976.759 5123.445 1.288 > MaskLogicOperationsBenchmark.longMaskAndNot thrpt 1 > 30 ops/ms 1024 1937.389 2573.394 1.328 > MaskLogicOperationsB... Hi, could anyone please help take a look at this PR? Thanks in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/30013#issuecomment-4002374980
