> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Monday, June 14, 2021 4:55 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Marcus Shawcroft
> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
> inverted operands
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi Richard,
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Monday, June 14, 2021 3:50 PM
> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> >> <richard.earns...@arm.com>; Marcus Shawcroft
> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> <kyrylo.tkac...@arm.com>
> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks
> >> on inverted operands
> >>
> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> > Hi All,
> >> >
> >> > This RFC is trying to address the following inefficiency when
> >> > vectorizing conditional statements with SVE.
> >> >
> >> > Consider the case
> >> >
> >> > void f10(double * restrict z, double * restrict w, double * restrict x,
> >> >   double * restrict y, int n)
> >> > {
> >> >     for (int i = 0; i < n; i++) {
> >> >         z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
> >> >     }
> >> > }
> >> >
> >> >
> >> > For which we currently generate at -O3:
> >> >
> >> > f10:
> >> >         cmp     w4, 0
> >> >         ble     .L1
> >> >         mov     x5, 0
> >> >         whilelo p1.d, wzr, w4
> >> >         ptrue   p3.b, all
> >> > .L3:
> >> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
> >> >         fcmgt   p2.d, p1/z, z1.d, #0.0
> >> >         fcmgt   p0.d, p3/z, z1.d, #0.0
> >> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
> >> >         bic     p0.b, p3/z, p1.b, p0.b
> >> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
> >> >         fsub    z0.d, p0/m, z0.d, z1.d
> >> >         movprfx z0.d, p2/m, z1.d
> >> >         fadd    z0.d, p2/m, z0.d, z2.d
> >> >         st1d    z0.d, p1, [x0, x5, lsl 3]
> >> >         incd    x5
> >> >         whilelo p1.d, w5, w4
> >> >         b.any   .L3
> >> > .L1:
> >> >         ret
> >> >
> >> > Notice that the condition for the else branch duplicates the same
> >> > predicate as the then branch and then uses BIC to negate the results.
> >> >
> >> > The reason for this is that during instruction generation in the
> >> > vectorizer we emit
> >> >
> >> >   mask__41.11_66 = vect__4.10_64 > vect_cst__65;
> >> >   vec_mask_and_69 = mask__41.11_66 & loop_mask_63;
> >> >   vec_mask_and_71 = mask__41.11_66 & loop_mask_63;
> >> >   mask__43.16_73 = ~mask__41.11_66;
> >> >   vec_mask_and_76 = mask__43.16_73 & loop_mask_63;
> >> >   vec_mask_and_78 = mask__43.16_73 & loop_mask_63;
> >> >
> >> > which ultimately gets optimized to
> >> >
> >> >   mask__41.11_66 = vect__4.10_64 > { 0.0, ... };
> >> >   vec_mask_and_69 = loop_mask_63 & mask__41.11_66;
> >> >   mask__43.16_73 = ~mask__41.11_66;
> >> >   vec_mask_and_76 = loop_mask_63 & mask__43.16_73;
> >> >
> >> > Notice how the negate is on the operation and not the predicate
> >> > resulting from the operation.  When this is expanded this turns
> >> > into RTL where the negate is on the compare directly.  This means
> >> > the RTL is different from the one without the negate and so CSE is
> >> > unable to
> >> recognize that they are essentially same operation.
> >> >
> >> > To fix this my patch changes it so you negate the mask rather than
> >> > the operation
> >> >
> >> >   mask__41.13_55 = vect__4.12_53 > { 0.0, ... };
> >> >   vec_mask_and_58 = loop_mask_52 & mask__41.13_55;
> >> >   vec_mask_op_67 = ~vec_mask_and_58;
> >> >   vec_mask_and_65 = loop_mask_52 & vec_mask_op_67;
> >>
> >> But to me this looks like a pessimisation in gimple terms.  We've
> >> increased the length of the critical path: vec_mask_and_65 now needs
> >> a chain of
> >> 4 operations instead of 3.
> >
> > True, but it should reduce the number of RTL patterns.  I would have
> > thought RTL is more expensive to handle than gimple.
> 
> I think this is only a fair gimple optimisation if gimple does the isel 
> itself (to a
> predicated compare and a predicated NOT).
> 
> >> We also need to be careful not to pessimise the case in which the
> >> comparison is an integer one.  At the moment we'll generate opposed
> >> conditions, which is the intended behaviour:
> >
> > This still happens with this patch at `-Ofast` because that flips the
> > conditions, So the different representation doesn't harm it.
> 
> OK, that's good.
> 
> >>
> >> .L3:
> >>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
> >>         cmpgt   p2.d, p0/z, z1.d, #0
> >>         movprfx z2, z1
> >>         scvtf   z2.d, p3/m, z1.d
> >>         cmple   p1.d, p0/z, z1.d, #0
> >>         ld1d    z0.d, p2/z, [x2, x5, lsl 3]
> >>         ld1d    z1.d, p1/z, [x3, x5, lsl 3]
> >>         fadd    z0.d, p2/m, z0.d, z2.d
> >>         movprfx z0.d, p1/m, z1.d
> >>         fsub    z0.d, p1/m, z0.d, z2.d
> >>         st1d    z0.d, p0, [x0, x5, lsl 3]
> >>         add     x5, x5, x6
> >>         whilelo p0.d, w5, w4
> >>         b.any   .L3
> >>
> >> Could we handle the fcmp case using a 3->2 define_split instead:
> >> convert
> >>
> >>    (set res (and (not (fcmp X Y)) Z)) ->
> >>      (set res (fcmp X Y))
> >>      (set res (and (not res) Z))
> >>
> >
> > This was the other approach I mentioned. It works, and gives you the neg,
> but only in the case where the compare is single use.
> 
> But in the original example we duplicate the comparison through a
> 2->2 combine, which leaves the original comparison as a single use.
> Isn't that enough?
> 
> > e.g. in
> >
> > void f11(double * restrict z, double * restrict w, double * restrict
> > x, double * restrict y, int n) {
> >     for (int i = 0; i < n; i++) {
> >         z[i] = (w[i] > 0) ? w[i] : y[i] - w[i];
> >     }
> > }
> >
> > You have some of the same problem. It generates
> >
> >         ld1d    z0.d, p0/z, [x1, x2, lsl 3]
> >         fcmgt   p2.d, p3/z, z0.d, #0.0
> >         bic     p1.b, p3/z, p0.b, p2.b
> >         ld1d    z1.d, p1/z, [x3, x2, lsl 3]
> >         fsub    z1.d, p1/m, z1.d, z0.d
> >         sel     z0.d, p2, z0.d, z1.d
> >         st1d    z0.d, p0, [x0, x2, lsl 3]
> >         incd    x2
> >         whilelo p0.d, w2, w4
> >
> > which has two problems. fcmgt doesn't need to be predicated on p3
> > which is ptrue all, it can/should be p0.
> >
> > With that fixed the splitter won't match because p2 is needed in the
> > sel, so it's not single use and so combine won't try to build the RTL so it 
> > can
> be split.
> 
> I think having the vectoriser avoid the dual use between the
> IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since that is
> the only pass that has the context to prove that including the loop mask in
> the VEC_COND_EXPR condition is correct.  We already try to do that to some
> extent:
> 

Sorry I have been looking at this these past couple of days and I just don't 
know how
this is supposed to work.

In the above example the problem is not just the use of p2 in the 
VEC_COND_EXPR. If
the VEC_COND_EXPR is changed to use p1 then p1 now has 3 uses which makes
combine still not try the combination.

But the general case

void f10(double * restrict z, double * restrict w, double * restrict x, double 
* restrict y, int n)
{
    for (int i = 0; i < n; i++) {
        z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
    }
}

Produces

f10:
        cmp     w4, 0
        ble     .L1
        mov     x5, 0
        whilelo p1.d, wzr, w4
        ptrue   p3.b, all
        .p2align 5,,15
.L3:
        ld1d    z1.d, p1/z, [x1, x5, lsl 3]
        fcmgt   p2.d, p1/z, z1.d, #0.0
        fcmgt   p0.d, p3/z, z1.d, #0.0
        ld1d    z2.d, p2/z, [x2, x5, lsl 3]
        bic     p0.b, p3/z, p1.b, p0.b
        ld1d    z0.d, p0/z, [x3, x5, lsl 3]
        fsub    z0.d, p0/m, z0.d, z1.d
        movprfx z0.d, p2/m, z1.d
        fadd    z0.d, p2/m, z0.d, z2.d
        st1d    z0.d, p1, [x0, x5, lsl 3]
        incd    x5
        whilelo p1.d, w5, w4
        b.any   .L3

where the VEC_COND_EXPR has been elided.

The problem is that the comparison for the inverse case is unpredicated.

Which causes the compiler to predicate it on ptrue when it's being generated 
from the load.
The resulting Gimple is

  <bb 3> [local count: 105119322]:
  bnd.9_49 = (unsigned int) n_14(D);
  max_mask_79 = .WHILE_ULT (0, bnd.9_49, { 0, ... });

  <bb 4> [local count: 630715932]:
  # ivtmp_76 = PHI <ivtmp_77(4), 0(3)>
  # loop_mask_52 = PHI <next_mask_80(4), max_mask_79(3)>
  _27 = &MEM <vector([2,2]) double> [(double *)w_15(D) + ivtmp_76 * 8];
  vect__4.12_53 = .MASK_LOAD (_27, 64B, loop_mask_52);
  mask__41.13_55 = vect__4.12_53 > { 0.0, ... };
  vec_mask_and_58 = loop_mask_52 & mask__41.13_55;
  _48 = &MEM <vector([2,2]) double> [(double *)x_18(D) + ivtmp_76 * 8];
  vect__6.16_59 = .MASK_LOAD (_48, 64B, vec_mask_and_58);
  mask__43.18_62 = ~mask__41.13_55;
  vec_mask_and_65 = loop_mask_52 & mask__43.18_62;
  _25 = &MEM <vector([2,2]) double> [(double *)y_16(D) + ivtmp_76 * 8];
  vect__8.21_66 = .MASK_LOAD (_25, 64B, vec_mask_and_65);
  vect_iftmp.22_68 = .COND_SUB (vec_mask_and_65, vect__8.21_66, vect__4.12_53, 
vect__8.21_66);
  _50 = .COND_ADD (vec_mask_and_58, vect__4.12_53, vect__6.16_59, 
vect_iftmp.22_68);
  _1 = &MEM <vector([2,2]) double> [(double *)z_20(D) + ivtmp_76 * 8];
  .MASK_STORE (_1, 64B, loop_mask_52, _50);
  ivtmp_77 = ivtmp_76 + POLY_INT_CST [2, 2];
  _2 = (unsigned int) ivtmp_77;
  next_mask_80 = .WHILE_ULT (_2, bnd.9_49, { 0, ... });

where the mask vec_mask_and_65 is masking the negate and not the compare 
because of how the expansion
is done.  

>   /* See whether another part of the vectorized code applies a loop
>      mask to the condition, or to its inverse.  */
> 
> but it would need extending to handle this case.

This code is fine as far as I can tell.  But there's nothing you can do here. 
The mask it needs is ~original
So it does not find an inverse mask to use because it has to honor floating 
point exceptions.

And indeed `-fno-trapping-math` or `-Ofast` generates the most optimal 
sequence, but when honoring
traps it can't re-use invert existing mask, which leaves the operation 
unpredicated.

So is what you're requesting that it looks inside unary operators and tries to 
CSE the thing they're pointed to?
In which case isn't it about the same as what I had before just that the 
vectorizer did the CSE itself?

If that's the case maybe it's better to do lookups into 
loop_vinfo->scalar_cond_masked_set in prepare_load_store_mask?
So that it just applies to everything?

Thanks,
Tamar

> 
> Thanks,
> Richard

Reply via email to