RE: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on inverted operands

Tamar Christina via Gcc-patches Thu, 01 Jul 2021 00:31:34 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Wednesday, June 30, 2021 6:55 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Marcus Shawcroft
> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
> inverted operands
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Monday, June 14, 2021 4:55 PM
> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> >> <richard.earns...@arm.com>; Marcus Shawcroft
> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> <kyrylo.tkac...@arm.com>
> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks
> >> on inverted operands
> >>
> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> > Hi Richard,
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> >> Sent: Monday, June 14, 2021 3:50 PM
> >> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> >> >> <richard.earns...@arm.com>; Marcus Shawcroft
> >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> >> <kyrylo.tkac...@arm.com>
> >> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison
> >> >> masks on inverted operands
> >> >>
> >> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> > Hi All,
> >> >> >
> >> >> > This RFC is trying to address the following inefficiency when
> >> >> > vectorizing conditional statements with SVE.
> >> >> >
> >> >> > Consider the case
> >> >> >
> >> >> > void f10(double * restrict z, double * restrict w, double * restrict 
> >> >> > x,
> >> >> >        double * restrict y, int n)
> >> >> > {
> >> >> >     for (int i = 0; i < n; i++) {
> >> >> >         z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
> >> >> >     }
> >> >> > }
> >> >> >
> >> >> >
> >> >> > For which we currently generate at -O3:
> >> >> >
> >> >> > f10:
> >> >> >         cmp     w4, 0
> >> >> >         ble     .L1
> >> >> >         mov     x5, 0
> >> >> >         whilelo p1.d, wzr, w4
> >> >> >         ptrue   p3.b, all
> >> >> > .L3:
> >> >> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
> >> >> >         fcmgt   p2.d, p1/z, z1.d, #0.0
> >> >> >         fcmgt   p0.d, p3/z, z1.d, #0.0
> >> >> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
> >> >> >         bic     p0.b, p3/z, p1.b, p0.b
> >> >> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
> >> >> >         fsub    z0.d, p0/m, z0.d, z1.d
> >> >> >         movprfx z0.d, p2/m, z1.d
> >> >> >         fadd    z0.d, p2/m, z0.d, z2.d
> >> >> >         st1d    z0.d, p1, [x0, x5, lsl 3]
> >> >> >         incd    x5
> >> >> >         whilelo p1.d, w5, w4
> >> >> >         b.any   .L3
> >> >> > .L1:
> >> >> >         ret
> >> >> >
> >> >> > Notice that the condition for the else branch duplicates the
> >> >> > same predicate as the then branch and then uses BIC to negate the
> results.
> >> >> >
> >> >> > The reason for this is that during instruction generation in the
> >> >> > vectorizer we emit
> >> >> >
> >> >> >   mask__41.11_66 = vect__4.10_64 > vect_cst__65;
> >> >> >   vec_mask_and_69 = mask__41.11_66 & loop_mask_63;
> >> >> >   vec_mask_and_71 = mask__41.11_66 & loop_mask_63;
> >> >> >   mask__43.16_73 = ~mask__41.11_66;
> >> >> >   vec_mask_and_76 = mask__43.16_73 & loop_mask_63;
> >> >> >   vec_mask_and_78 = mask__43.16_73 & loop_mask_63;
> >> >> >
> >> >> > which ultimately gets optimized to
> >> >> >
> >> >> >   mask__41.11_66 = vect__4.10_64 > { 0.0, ... };
> >> >> >   vec_mask_and_69 = loop_mask_63 & mask__41.11_66;
> >> >> >   mask__43.16_73 = ~mask__41.11_66;
> >> >> >   vec_mask_and_76 = loop_mask_63 & mask__43.16_73;
> >> >> >
> >> >> > Notice how the negate is on the operation and not the predicate
> >> >> > resulting from the operation.  When this is expanded this turns
> >> >> > into RTL where the negate is on the compare directly.  This
> >> >> > means the RTL is different from the one without the negate and
> >> >> > so CSE is unable to
> >> >> recognize that they are essentially same operation.
> >> >> >
> >> >> > To fix this my patch changes it so you negate the mask rather
> >> >> > than the operation
> >> >> >
> >> >> >   mask__41.13_55 = vect__4.12_53 > { 0.0, ... };
> >> >> >   vec_mask_and_58 = loop_mask_52 & mask__41.13_55;
> >> >> >   vec_mask_op_67 = ~vec_mask_and_58;
> >> >> >   vec_mask_and_65 = loop_mask_52 & vec_mask_op_67;
> >> >>
> >> >> But to me this looks like a pessimisation in gimple terms.  We've
> >> >> increased the length of the critical path: vec_mask_and_65 now
> >> >> needs a chain of
> >> >> 4 operations instead of 3.
> >> >
> >> > True, but it should reduce the number of RTL patterns.  I would
> >> > have thought RTL is more expensive to handle than gimple.
> >>
> >> I think this is only a fair gimple optimisation if gimple does the
> >> isel itself (to a predicated compare and a predicated NOT).
> >>
> >> >> We also need to be careful not to pessimise the case in which the
> >> >> comparison is an integer one.  At the moment we'll generate
> >> >> opposed conditions, which is the intended behaviour:
> >> >
> >> > This still happens with this patch at `-Ofast` because that flips
> >> > the conditions, So the different representation doesn't harm it.
> >>
> >> OK, that's good.
> >>
> >> >>
> >> >> .L3:
> >> >>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
> >> >>         cmpgt   p2.d, p0/z, z1.d, #0
> >> >>         movprfx z2, z1
> >> >>         scvtf   z2.d, p3/m, z1.d
> >> >>         cmple   p1.d, p0/z, z1.d, #0
> >> >>         ld1d    z0.d, p2/z, [x2, x5, lsl 3]
> >> >>         ld1d    z1.d, p1/z, [x3, x5, lsl 3]
> >> >>         fadd    z0.d, p2/m, z0.d, z2.d
> >> >>         movprfx z0.d, p1/m, z1.d
> >> >>         fsub    z0.d, p1/m, z0.d, z2.d
> >> >>         st1d    z0.d, p0, [x0, x5, lsl 3]
> >> >>         add     x5, x5, x6
> >> >>         whilelo p0.d, w5, w4
> >> >>         b.any   .L3
> >> >>
> >> >> Could we handle the fcmp case using a 3->2 define_split instead:
> >> >> convert
> >> >>
> >> >>    (set res (and (not (fcmp X Y)) Z)) ->
> >> >>      (set res (fcmp X Y))
> >> >>      (set res (and (not res) Z))
> >> >>
> >> >
> >> > This was the other approach I mentioned. It works, and gives you
> >> > the neg,
> >> but only in the case where the compare is single use.
> >>
> >> But in the original example we duplicate the comparison through a
> >> 2->2 combine, which leaves the original comparison as a single use.
> >> Isn't that enough?
> >>
> >> > e.g. in
> >> >
> >> > void f11(double * restrict z, double * restrict w, double *
> >> > restrict x, double * restrict y, int n) {
> >> >     for (int i = 0; i < n; i++) {
> >> >         z[i] = (w[i] > 0) ? w[i] : y[i] - w[i];
> >> >     }
> >> > }
> >> >
> >> > You have some of the same problem. It generates
> >> >
> >> >         ld1d    z0.d, p0/z, [x1, x2, lsl 3]
> >> >         fcmgt   p2.d, p3/z, z0.d, #0.0
> >> >         bic     p1.b, p3/z, p0.b, p2.b
> >> >         ld1d    z1.d, p1/z, [x3, x2, lsl 3]
> >> >         fsub    z1.d, p1/m, z1.d, z0.d
> >> >         sel     z0.d, p2, z0.d, z1.d
> >> >         st1d    z0.d, p0, [x0, x2, lsl 3]
> >> >         incd    x2
> >> >         whilelo p0.d, w2, w4
> >> >
> >> > which has two problems. fcmgt doesn't need to be predicated on p3
> >> > which is ptrue all, it can/should be p0.
> >> >
> >> > With that fixed the splitter won't match because p2 is needed in
> >> > the sel, so it's not single use and so combine won't try to build
> >> > the RTL so it can
> >> be split.
> >>
> >> I think having the vectoriser avoid the dual use between the
> >> IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since
> that is
> >> the only pass that has the context to prove that including the loop
> >> mask in the VEC_COND_EXPR condition is correct.  We already try to do
> >> that to some
> >> extent:
> >>
> >
> > Sorry I have been looking at this these past couple of days and I just
> > don't know how this is supposed to work.
> >
> > In the above example the problem is not just the use of p2 in the
> > VEC_COND_EXPR. If the VEC_COND_EXPR is changed to use p1 then p1
> now
> > has 3 uses which makes combine still not try the combination.
> >
> > But the general case
> >
> > void f10(double * restrict z, double * restrict w, double * restrict
> > x, double * restrict y, int n) {
> >     for (int i = 0; i < n; i++) {
> >         z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
> >     }
> > }
> >
> > Produces
> >
> > f10:
> >         cmp     w4, 0
> >         ble     .L1
> >         mov     x5, 0
> >         whilelo p1.d, wzr, w4
> >         ptrue   p3.b, all
> >         .p2align 5,,15
> > .L3:
> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
> >         fcmgt   p2.d, p1/z, z1.d, #0.0
> >         fcmgt   p0.d, p3/z, z1.d, #0.0
> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
> >         bic     p0.b, p3/z, p1.b, p0.b
> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
> >         fsub    z0.d, p0/m, z0.d, z1.d
> >         movprfx z0.d, p2/m, z1.d
> >         fadd    z0.d, p2/m, z0.d, z2.d
> >         st1d    z0.d, p1, [x0, x5, lsl 3]
> >         incd    x5
> >         whilelo p1.d, w5, w4
> >         b.any   .L3
> >
> > where the VEC_COND_EXPR has been elided.
> >
> > The problem is that the comparison for the inverse case is unpredicated.
> 
> Yeah, for the original f10 example I was suggesting that we use combine
> instead of changing the vectoriser.  It turns out that the 3->2 split thing I
> mentioned above won't work though, because we match the BIC first.
> But I think something like:
> 
> (define_insn_and_split "*inverted_fcmgt"
>   [(set (match_operand:<VPRED> 0 "register_operand" "=Upa")
>       (and:<VPRED>
>         (and:<VPRED>
>           (not:<VPRED>
>             (unspec:<VPRED>
>               [(match_operand:<VPRED> 1 "register_operand" "Upl")
>                (const_int SVE_KNOWN_PTRUE)
>                (match_operand:SVE_FULL_F 2 "register_operand" "w")
>                (match_operand:SVE_FULL_F 3
> "aarch64_simd_reg_or_zero" "wDz")]
>               SVE_COND_FP_CMP_I0))
>           (match_operand:<VPRED> 4 "register_operand" "w"))
>         (match_dup 1)))]
>   "TARGET_SVE && !reg_overlap_mentioned_p (operands[4], operands[0])"
>   "#"
>   "&& 1"
>   [(set (match_dup 0)
>       (unspec:<VPRED>
>         [(match_dup 4)
>          (const_int SVE_MAYBE_NOT_PTRUE)
>          (match_dup 2)
>          (match_dup 3)]
>         SVE_COND_FP_CMP_I0))
>    (set (match_dup 0)
>       (and:<VPRED> (not:<VPRED> (match_dup 0)) (match_dup 4)))]
> )
> 
> is a legitimate optimisation in its own right because it gets rid of the PTRUE
> operand (at split time).  This would be a good thing to do wherever the
> FCMxx and BIC come from.


Ah I see...

I was trying with

;; Make sure that inversions of masked comparisons are always on the mask
;; instead of on the operation.
(define_insn_and_split "*mask_inv_combine"
  [(match_scratch:<VPRED> 5 "=Upa, Upa")
   (set (match_operand:<VPRED> 0 "register_operand" "=Upa, Upa")
       (and:<VPRED>
         (and:<VPRED>
           (not:<VPRED>
             (unspec:<VPRED>
               [(match_operand:<VPRED> 1)
                (const_int SVE_KNOWN_PTRUE)
                (match_operand:SVE_FULL_F 2 "register_operand" "w, w")
                (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "Dz, w")]
               SVE_COND_FP_CMP_I0))
           (match_operand:<VPRED> 4 "register_operand" "Upa, Upa"))
             (match_dup:<VPRED> 1)))]
  "TARGET_SVE"
  "#"
  "&& 1"
  [(set (match_dup 0)
       (and:<VPRED>
         (unspec:<VPRED>
           [(match_dup 1)
            (const_int SVE_KNOWN_PTRUE)
            (match_dup 2)
            (match_dup 3)]
           SVE_COND_FP_CMP_I0)
             (match_dup 4)))
   (set (match_dup 0)
       (and:<VPRED>
         (not:<VPRED>
           (match_dup 0))
         (match_dup 4)))]
)

But the difference seems to be I need to use    (const_int SVE_MAYBE_NOT_PTRUE)
After the split. Could you perhaps point me to what this means in RTL?

The predicate itself is operand1 so just want to know what exactly this does to 
the semantics.

> 
> The snag is that we don't then CSE the comparisons between split1 and RA.
> The post-RA optimisers might then leave a move.  E.g. for plain -O3 the
> above pattern gives the expected:
> 
>         cmp     w4, 0
>         ble     .L1
>         mov     x5, 0
>         cntd    x6
>         whilelo p0.d, wzr, w4
>         .p2align 3,,7
> .L3:
>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
>         fcmgt   p2.d, p0/z, z1.d, #0.0
>         not     p1.b, p0/z, p2.b
>         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
>         ld1d    z0.d, p1/z, [x3, x5, lsl 3]
>         fsub    z0.d, p1/m, z0.d, z1.d
>         movprfx z0.d, p2/m, z1.d
>         fadd    z0.d, p2/m, z0.d, z2.d
>         st1d    z0.d, p0, [x0, x5, lsl 3]
>         add     x5, x5, x6
>         whilelo p0.d, w5, w4
>         b.any   .L3
> .L1:
>         ret
> 
> but -O3 -fno-trapping-math has a predicate move:
> 
>         cmp     w4, 0
>         ble     .L1
>         mov     x5, 0
>         cntd    x6
>         whilelo p0.d, wzr, w4
>         .p2align 3,,7
> .L3:
>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
>         fcmgt   p1.d, p0/z, z1.d, #0.0
>         mov     p2.b, p1.b
>         not     p1.b, p0/z, p1.b
>         ld1d    z0.d, p2/z, [x2, x5, lsl 3]
>         ld1d    z2.d, p1/z, [x3, x5, lsl 3]
>         fadd    z0.d, z1.d, z0.d
>         movprfx z0.d, p1/m, z2.d
>         fsub    z0.d, p1/m, z0.d, z1.d
>         st1d    z0.d, p0, [x0, x5, lsl 3]
>         add     x5, x5, x6
>         whilelo p0.d, w5, w4
>         b.any   .L3
> .L1:
>         ret
> 
> It would be good to fix this in RTL “somehow”.
> 
> But for f11 we still get:
> 
> f11:
> .LFB1:
>         .cfi_startproc
>         cmp     w4, 0
>         ble     .L6
>         mov     x2, 0
>         cntd    x5
>         whilelo p1.d, wzr, w4
>         ptrue   p3.b, all
>         .p2align 3,,7
> .L8:
>         ld1d    z0.d, p1/z, [x1, x2, lsl 3]
>         fcmgt   p0.d, p1/z, z0.d, #0.0
>         fcmgt   p2.d, p3/z, z0.d, #0.0
>         not     p0.b, p1/z, p0.b
>         ld1d    z1.d, p0/z, [x3, x2, lsl 3]
>         fsub    z1.d, p0/m, z1.d, z0.d
>         sel     z0.d, p2, z0.d, z1.d
>         st1d    z0.d, p1, [x0, x2, lsl 3]
>         add     x2, x2, x5
>         whilelo p1.d, w2, w4
>         b.any   .L8
> .L6:
>         ret
> 
> This is where the VEC_COND_EXPR code I mentioned should come in.
> At the moment, before generating:
> 
>   VEC_COND_EXPR<C, E1, E2>
> 
> we check whether there are any other operations predicated on C &
> LOOP_MASK.  If so, we use:
> 
>   VEC_COND_EXPR<C & LOOP_MASK, E1, E2>
> 
> instead.  This avoids both C and C & LOOP_MASK being live at the same time.
> 
> The change I was suggesting was that we should also check whether there
> are any operations predicated on ~C & LOOP_MASK.  If so, we should
> generate:
> 
>   VEC_COND_EXPR<~C & LOOP_MASK, E2, E1>
> 
> Again, the justification is that we don't have C and ~C & LOOP_MASK live at
> the same time.  If C itself has the form ~C' then swapping the operands also
> saves a negation.
> 
> > This code is fine as far as I can tell.  But there's nothing you can
> > do here. The mask it needs is ~original So it does not find an inverse mask
> to use because it has to honor floating point exceptions.
> >
> > And indeed `-fno-trapping-math` or `-Ofast` generates the most optimal
> > sequence, but when honoring traps it can't re-use invert existing mask,
> which leaves the operation unpredicated.
> >
> > So is what you're requesting that it looks inside unary operators and tries 
> > to
> CSE the thing they're pointed to?
> 
> [Hopefully answered this above]

Yes it does, along with the split above. I had originally thought you meant the
VEC_COND_EXPR should take care of both problems..

> 
> > In which case isn't it about the same as what I had before just that the
> vectorizer did the CSE itself?
> 
> IMO the VEC_COND_EXPR optimisation is more selective.  It only changes
> the gimple if we know that it will have an effect (in terms of reducing the
> number of SSA_NAMEs that have multiple uses).  Also, unlike the original
> patch, there's no double-masking invovled: ~C & LOOP_MASK will only apply
> LOOP_MASK once.
> 
> > If that's the case maybe it's better to do lookups into loop_vinfo-
> >scalar_cond_masked_set in prepare_load_store_mask?
> > So that it just applies to everything?
> 
> scalar_cond_masked_set only exists to convert a mask that ignores
> LOOP_MASK into one that is ANDed with LOOP_MASK.  For loads and stores
> we always AND with LOOP_MASK, so a lookup isn't necessary.

Fair enough, 

Thanks for the explanation!

I'll keep this in mind for the other patches.

Thanks,
Tamar
> 
> Thanks,
> Richard

RE: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on inverted operands

Reply via email to