Re: RFR: 8354242: VectorAPI: combine vector not operation with compare [v2]

erifan Thu, 24 Apr 2025 02:52:33 -0700

On Wed, 23 Apr 2025 12:09:51 GMT, Jatin Bhateja <[email protected]> wrote:


>> erifan has updated the pull request with a new target base due to a merge or 
>> a rebase. The incremental webrev excludes the unrelated changes brought in 
>> by the merge/rebase. The pull request contains two additional commits since 
>> the last revision:
>> 
>>  - Merge branch 'master' into JDK-8354242
>>  - 8354242: VectorAPI: combine vector not operation with compare
>>    
>>    This patch optimizes the following patterns:
>>    For integer types:
>>    ```
>>    (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1))
>>        => (VectorMaskCmp src1 src2 ncond)
>>    (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1))
>>        => (VectorMaskCmp src1 src2 ncond)
>>    ```
>>    cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the
>>    negative comparison of cond.
>>    
>>    For float and double types:
>>    ```
>>    (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
>>        => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>>    (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1))
>>        => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>>    ```
>>    cond can be eq or ne.
>>    
>>    Benchmarks on Nvidia Grace machine with 128-bit SVE2:
>>    With option `-XX:UseSVE=2`:
>>    ```
>>    Benchmark                 Unit    Before          Score Error     After   
>>         Score Error     Uplift
>>    testCompareEQMaskNotByte  ops/s   7912127.225     2677.289518     
>> 10266136.26     8955.008548     1.29
>>    testCompareEQMaskNotDouble        ops/s   884737.6799     446.963779      
>> 1179760.772     448.031844      1.33
>>    testCompareEQMaskNotFloat ops/s   1765045.787     682.332214      
>> 2359520.803     896.305743      1.33
>>    testCompareEQMaskNotInt           ops/s   1787221.411     977.743935      
>> 2353952.519     960.069976      1.31
>>    testCompareEQMaskNotLong  ops/s   895297.1974     673.44808       
>> 1178449.02      323.804205      1.31
>>    testCompareEQMaskNotShort ops/s   3339987.002     3415.2226       
>> 4712761.965     2110.862053     1.41
>>    testCompareGEMaskNotByte  ops/s   7907615.16      4094.243652     
>> 10251646.9      9486.699831     1.29
>>    testCompareGEMaskNotInt           ops/s   1683738.958     4233.813092     
>> 2352855.205     1251.952546     1.39
>>    testCompareGEMaskNotLong  ops/s   854496.1561     8594.598885     
>> 1177811.493     521.1229        1.37
>>    testCompareGEMaskNotShort ops/s   3341860.309     1578.975338     
>> 4714008.434     1681.10365      1.41
>>    testCompareGTMaskNotByte  ops/s   7910823.674     2993.367032     
>> 10245063.58     9774.75138      1.29
>>    testCompareGTMaskNotInt           ops/s   1673393.928     3153.099431     
>> 2353654.521     1190.848583     1.4
>>    testCompareGTMaskNotLong  ops/s   849405.9159     2432.858159     
>> 1177952.041     359.96413       1.38
>>    testCompareGTMaskNotShort ops/s   3339509.141     ...
>
> src/hotspot/share/opto/vectornode.cpp line 2234:
> 
>> 2232:   // XorV/XorVMask is commutative, swap 
>> VectorMaskCmp/Op_VectorMaskCast to in1.
>> 2233:   if (in1->Opcode() != Op_VectorMaskCmp && in1->Opcode() != 
>> Op_VectorMaskCast) {
>> 2234:     swap(in1, in2);
> 
> Swapping inputs like this without refreshing GVN bookkeeping is not safe. I 
> guess you wanted to use Node::swap_edges.

The edges are not swapped, but two variables in1 and in2

> src/hotspot/share/opto/vectornode.cpp line 2243:
> 
>> 2241:     in1 = in1->in(1);
>> 2242:   }
>> 2243:   if (in1->Opcode() != Op_VectorMaskCmp || in1->outcnt() > 1 ||
> 
> Checks on outcnt on line 2243 and 2238 can be removed. Idealization looks for 
> a specific graph palette and replaces it with a new node whose inputs are the 
> same as the inputs of the palette. GVN will do the retention job if any 
> intermediate node has users beyond the pattern being replaced.

Thanks for telling me this information. Another more important reason to check 
outcnt here is to prevent this optimization when the uses of VectorMaskCmp is 
greater than 1, because this optimization may not be worthwhile. For example:


  public static void testVectorMaskCmp() {
    IntVector bv = IntVector.fromArray(I_SPECIES, ib, 0);
    IntVector av = IntVector.fromArray(I_SPECIES, ia, 0);
    VectorMask<Integer> m1 = av.compare(VectorOperators.NE, bv);  // two uses
    VectorMask<Integer> m2 =m1.not();
    m1.intoArray(m, 0);
    av.lanewise(VectorOperators.ABS, m2).intoArray(ia, 0);
  }


If we do not check outcnt and still do this optimization, two VectorMaskCmp 
nodes will be generated, and finally two VectorMaskCmp instructions will be 
generated. This is unreasonable because VectorMaskCmp has much higher latency 
than xor instruction on aarch64.

> src/hotspot/share/opto/vectornode.cpp line 2265:
> 
>> 2263:     vmcmp = new VectorMaskCastNode(phase->transform(vmcmp), vmcast_vt);
>> 2264:   }
>> 2265:   return vmcmp;
> 
> It would be preferable if you could kindly re-factor the code such that we 
> only call VectorNode::Ideal once at return to comply with aesthetics of other 
> idealization routines.

Ok, I'll change it in the next commit.

> test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 38:
> 
>> 36:  * @summary test combining vector not operation with compare
>> 37:  * @modules jdk.incubator.vector
>> 38:  * @requires ((os.arch!="x86" & os.arch!="i386" & os.arch!="amd64" & 
>> os.arch!="x86_64") | vm.cpu.features ~= ".*avx.*")
> 
> You can remove this platform limitation and forward the constraints to @IR 
> rules using applyIfCPUFeatureOr

Since this is a platform independent optimization, I tend to use this 
`@requires` because it's simpler.  If we use `applyIfCPUFeatureOr`, we need to 
add the same restriction before each test. In addition, if a new architecture 
supports the vector node, this test may not cover it. What do you think?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2057901771
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2057976512
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2057982231
PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2057993569

Re: RFR: 8354242: VectorAPI: combine vector not operation with compare [v2]

Reply via email to