On Fri, 6 Jun 2025 10:38:11 GMT, erifan <d...@openjdk.org> wrote:

>> This patch optimizes the following patterns:
>> For integer types:
>> 
>> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1))
>>     => (VectorMaskCmp src1 src2 ncond)
>> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1))
>>     => (VectorMaskCmp src1 src2 ncond)
>> 
>> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the 
>> negative comparison of cond.
>> 
>> For float and double types:
>> 
>> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
>>     => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1))
>>     => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>> 
>> cond can be eq or ne.
>> 
>> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option 
>> `-XX:UseSVE=2`:
>> 
>> Benchmark                    Unit    Before          Score Error     After   
>>         Score Error     Uplift
>> testCompareEQMaskNotByte     ops/s   7912127.225     2677.289518     
>> 10266136.26     8955.008548     1.29
>> testCompareEQMaskNotDouble   ops/s   884737.6799     446.963779      
>> 1179760.772     448.031844      1.33
>> testCompareEQMaskNotFloat    ops/s   1765045.787     682.332214      
>> 2359520.803     896.305743      1.33
>> testCompareEQMaskNotInt              ops/s   1787221.411     977.743935      
>> 2353952.519     960.069976      1.31
>> testCompareEQMaskNotLong     ops/s   895297.1974     673.44808       
>> 1178449.02      323.804205      1.31
>> testCompareEQMaskNotShort    ops/s   3339987.002     3415.2226       
>> 4712761.965     2110.862053     1.41
>> testCompareGEMaskNotByte     ops/s   7907615.16      4094.243652     
>> 10251646.9      9486.699831     1.29
>> testCompareGEMaskNotInt              ops/s   1683738.958     4233.813092     
>> 2352855.205     1251.952546     1.39
>> testCompareGEMaskNotLong     ops/s   854496.1561     8594.598885     
>> 1177811.493     521.1229        1.37
>> testCompareGEMaskNotShort    ops/s   3341860.309     1578.975338     
>> 4714008.434     1681.10365      1.41
>> testCompareGTMaskNotByte     ops/s   7910823.674     2993.367032     
>> 10245063.58     9774.75138      1.29
>> testCompareGTMaskNotInt              ops/s   1673393.928     3153.099431     
>> 2353654.521     1190.848583     1.4
>> testCompareGTMaskNotLong     ops/s   849405.9159     2432.858159     
>> 1177952.041     359.96413       1.38
>> testCompareGTMaskNotShort    ops/s   3339509.141     3339.976585     
>> 4711442.496     2673.364893     1.41
>> testCompareLEMaskNotByte     ops/s   7911340.004     3114.69191      
>> 10231626.5      27134.20035     1.29
>> testCompareLEMaskNotInt              ops/s   1675812.113     1340.969885     
>> 2353255.341     1452.4522       1.4
>> testCompareLEMaskNotLong     ops/s   848862.8036     6564.841731     
>> 1177763.623     539.290106      1.38
>> testCompareLEMaskNotShort    ops/s   3324951.54      2380.29473      
>> 4712116.251     1544.559684     1.41
>> testCompareLTMaskNotByte     ops/s   7910390.844     2630.861436     
>> 10239567.69     6487.441672     1.29
>> testCompareLTMaskNotInt              ops/s   16721...
>
> erifan has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Support negating unsigned comparison for BoolTest::mask
>   
>   Added a static method `negate_mask(mask btm)` into BoolTest class to
>   negate both signed and unsigned comparison.

@erifan Thanks for the updates, I have some more comments :)

src/hotspot/share/opto/subnode.hpp line 333:

> 331:   mask negate( ) const { return mask(_test^4); }
> 332:   // Return the negative mask for the given mask, for both signed and 
> unsigned comparison.
> 333:   static mask negate_mask(mask btm) { return mask(btm^4); }

Suggestion:

  static mask negate_mask(mask btm) { return mask(btm ^ 4); }


https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md

> Use spaces around operators, especially comparisons and assignments. 
> (Relaxable for boolean expressions and high-precedence operators in classic 
> math-style formulas.)

src/hotspot/share/opto/vectornode.cpp line 2226:

> 2224: 
> 2225:   const TypeVect* vector_mask_cast_vt = nullptr;
> 2226:   // in1 should be single used, otherwise the optimization may be 
> unprofitable.

Suggestion:

  // in1 should only have a single use, otherwise the optimization may be 
unprofitable.

src/hotspot/share/opto/vectornode.cpp line 2227:

> 2225:   const TypeVect* vector_mask_cast_vt = nullptr;
> 2226:   // in1 should be single used, otherwise the optimization may be 
> unprofitable.
> 2227:   if (in1->Opcode() == Op_VectorMaskCast && in1->outcnt() == 1 && 
> in1->in(1)->Opcode() == Op_VectorMaskCmp) {

`in1->in(1)->Opcode() == Op_VectorMaskCmp`
Is this check here even necessary? Because we check it below again, right?
`in1->Opcode() != Op_VectorMaskCmp`

src/hotspot/share/opto/vectornode.cpp line 2237:

> 2235:       !VectorNode::is_all_ones_vector(in2)) {
> 2236:     return nullptr;
> 2237:   }

Similarly here: do you have tests for these conditions, that we do not optimize 
if any of these fail?

src/hotspot/share/opto/vectornode.cpp line 2239:

> 2237:   }
> 2238: 
> 2239:   BoolTest::mask neg_cond = BoolTest::negate_mask(((VectorMaskCmpNode*) 
> in1)->get_predicate());

Suggestion:

  BoolTest::mask neg_cond = 
BoolTest::negate_mask((in1->as_VectorMaskCmp())->get_predicate());

Does that compile? It would be prefereable.

src/hotspot/share/opto/vectornode.cpp line 2243:

> 2241:   const TypeVect* vt = in1->as_Vector()->vect_type();
> 2242:   Node* res = new VectorMaskCmpNode(neg_cond, in1->in(1), in1->in(2),
> 2243:                                       predicate_node, vt);

Suggestion:

  Node* res = new VectorMaskCmpNode(neg_cond, in1->in(1), in1->in(2),
                                    predicate_node, vt);

Alignment

test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 158:

> 156:         } else if (op == VectorOperators.UGT) {
> 157:             Asserts.assertEquals(compareUnsigned(a, b) <= 0, r);
> 158:         }

Please refactor it as a `switch`. And add a `default` case where you throw some 
`RuntimeException`. just to make sure we are not missing anything :)

test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 244:

> 242:         testCompareMaskNotByte(VectorOperators.EQ, m -> m.not());
> 243:         testCompareMaskNotByte(VectorOperators.EQ, m -> 
> m.xor(B_SPECIES.maskAll(true)));
> 244:     }

Could it happen that the verification is inlined in the test body?

Currently, the verification is probably inlined, but the code there is not 
vectorized. But what if one day the auto-vectorizer is smart enough and 
vectorizes it, and creates vectors that we currently check `count ...= 0`?

At least, you could ensure that the verification does not get inlined, with 
`@DontInline`.

What do you think?

test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 623:

> 621:         testCompareMaskNotFloat(VectorOperators.NE, fa, fninf, m -> 
> m.not());
> 622:         testCompareMaskNotFloat(VectorOperators.NE, fa, fninf, m -> 
> m.xor(F_SPECIES.maskAll(true)));
> 623:     }

Something makes me a little nervous about the correctness in these IR rules:

You are checking `IRNode.XOR_VL, "= 0"`. But you are comparing `floats`. Does 
that make sense?

Also: in the whole test, there is no single case where you expect the `XOR_V` 
to still be in the IR. I think it would be good to have one "control test" at 
least, where you test in a very similar pattern that this node is still there, 
and does not optimize. Maybe you can use a case where the construct has 
multiple uses, and is therefore not profitable to be optimized. What do you 
think?

test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 692:

> 690:         TestFramework testFramework = new TestFramework();
> 691:         testFramework.addFlags("--add-modules=jdk.incubator.vector");
> 692:         testFramework.setDefaultWarmup(10000);

The default is `2000` is that not enough?

Increasing it means the test runs slower, here probably about 5x.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-2915634768
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139189790
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139199315
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139201553
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139206813
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139216182
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139217776
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139227321
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139234183
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139239614
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2139243617

Reply via email to