> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Wednesday, June 30, 2021 6:55 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on > inverted operands > > Tamar Christina <tamar.christ...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Monday, June 14, 2021 4:55 PM > >> To: Tamar Christina <tamar.christ...@arm.com> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> <richard.earns...@arm.com>; Marcus Shawcroft > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks > >> on inverted operands > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> > Hi Richard, > >> >> -----Original Message----- > >> >> From: Richard Sandiford <richard.sandif...@arm.com> > >> >> Sent: Monday, June 14, 2021 3:50 PM > >> >> To: Tamar Christina <tamar.christ...@arm.com> > >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> >> <richard.earns...@arm.com>; Marcus Shawcroft > >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > >> <kyrylo.tkac...@arm.com> > >> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison > >> >> masks on inverted operands > >> >> > >> >> Tamar Christina <tamar.christ...@arm.com> writes: > >> >> > Hi All, > >> >> > > >> >> > This RFC is trying to address the following inefficiency when > >> >> > vectorizing conditional statements with SVE. > >> >> > > >> >> > Consider the case > >> >> > > >> >> > void f10(double * restrict z, double * restrict w, double * restrict > >> >> > x, > >> >> > double * restrict y, int n) > >> >> > { > >> >> > for (int i = 0; i < n; i++) { > >> >> > z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i]; > >> >> > } > >> >> > } > >> >> > > >> >> > > >> >> > For which we currently generate at -O3: > >> >> > > >> >> > f10: > >> >> > cmp w4, 0 > >> >> > ble .L1 > >> >> > mov x5, 0 > >> >> > whilelo p1.d, wzr, w4 > >> >> > ptrue p3.b, all > >> >> > .L3: > >> >> > ld1d z1.d, p1/z, [x1, x5, lsl 3] > >> >> > fcmgt p2.d, p1/z, z1.d, #0.0 > >> >> > fcmgt p0.d, p3/z, z1.d, #0.0 > >> >> > ld1d z2.d, p2/z, [x2, x5, lsl 3] > >> >> > bic p0.b, p3/z, p1.b, p0.b > >> >> > ld1d z0.d, p0/z, [x3, x5, lsl 3] > >> >> > fsub z0.d, p0/m, z0.d, z1.d > >> >> > movprfx z0.d, p2/m, z1.d > >> >> > fadd z0.d, p2/m, z0.d, z2.d > >> >> > st1d z0.d, p1, [x0, x5, lsl 3] > >> >> > incd x5 > >> >> > whilelo p1.d, w5, w4 > >> >> > b.any .L3 > >> >> > .L1: > >> >> > ret > >> >> > > >> >> > Notice that the condition for the else branch duplicates the > >> >> > same predicate as the then branch and then uses BIC to negate the > results. > >> >> > > >> >> > The reason for this is that during instruction generation in the > >> >> > vectorizer we emit > >> >> > > >> >> > mask__41.11_66 = vect__4.10_64 > vect_cst__65; > >> >> > vec_mask_and_69 = mask__41.11_66 & loop_mask_63; > >> >> > vec_mask_and_71 = mask__41.11_66 & loop_mask_63; > >> >> > mask__43.16_73 = ~mask__41.11_66; > >> >> > vec_mask_and_76 = mask__43.16_73 & loop_mask_63; > >> >> > vec_mask_and_78 = mask__43.16_73 & loop_mask_63; > >> >> > > >> >> > which ultimately gets optimized to > >> >> > > >> >> > mask__41.11_66 = vect__4.10_64 > { 0.0, ... }; > >> >> > vec_mask_and_69 = loop_mask_63 & mask__41.11_66; > >> >> > mask__43.16_73 = ~mask__41.11_66; > >> >> > vec_mask_and_76 = loop_mask_63 & mask__43.16_73; > >> >> > > >> >> > Notice how the negate is on the operation and not the predicate > >> >> > resulting from the operation. When this is expanded this turns > >> >> > into RTL where the negate is on the compare directly. This > >> >> > means the RTL is different from the one without the negate and > >> >> > so CSE is unable to > >> >> recognize that they are essentially same operation. > >> >> > > >> >> > To fix this my patch changes it so you negate the mask rather > >> >> > than the operation > >> >> > > >> >> > mask__41.13_55 = vect__4.12_53 > { 0.0, ... }; > >> >> > vec_mask_and_58 = loop_mask_52 & mask__41.13_55; > >> >> > vec_mask_op_67 = ~vec_mask_and_58; > >> >> > vec_mask_and_65 = loop_mask_52 & vec_mask_op_67; > >> >> > >> >> But to me this looks like a pessimisation in gimple terms. We've > >> >> increased the length of the critical path: vec_mask_and_65 now > >> >> needs a chain of > >> >> 4 operations instead of 3. > >> > > >> > True, but it should reduce the number of RTL patterns. I would > >> > have thought RTL is more expensive to handle than gimple. > >> > >> I think this is only a fair gimple optimisation if gimple does the > >> isel itself (to a predicated compare and a predicated NOT). > >> > >> >> We also need to be careful not to pessimise the case in which the > >> >> comparison is an integer one. At the moment we'll generate > >> >> opposed conditions, which is the intended behaviour: > >> > > >> > This still happens with this patch at `-Ofast` because that flips > >> > the conditions, So the different representation doesn't harm it. > >> > >> OK, that's good. > >> > >> >> > >> >> .L3: > >> >> ld1d z1.d, p0/z, [x1, x5, lsl 3] > >> >> cmpgt p2.d, p0/z, z1.d, #0 > >> >> movprfx z2, z1 > >> >> scvtf z2.d, p3/m, z1.d > >> >> cmple p1.d, p0/z, z1.d, #0 > >> >> ld1d z0.d, p2/z, [x2, x5, lsl 3] > >> >> ld1d z1.d, p1/z, [x3, x5, lsl 3] > >> >> fadd z0.d, p2/m, z0.d, z2.d > >> >> movprfx z0.d, p1/m, z1.d > >> >> fsub z0.d, p1/m, z0.d, z2.d > >> >> st1d z0.d, p0, [x0, x5, lsl 3] > >> >> add x5, x5, x6 > >> >> whilelo p0.d, w5, w4 > >> >> b.any .L3 > >> >> > >> >> Could we handle the fcmp case using a 3->2 define_split instead: > >> >> convert > >> >> > >> >> (set res (and (not (fcmp X Y)) Z)) -> > >> >> (set res (fcmp X Y)) > >> >> (set res (and (not res) Z)) > >> >> > >> > > >> > This was the other approach I mentioned. It works, and gives you > >> > the neg, > >> but only in the case where the compare is single use. > >> > >> But in the original example we duplicate the comparison through a > >> 2->2 combine, which leaves the original comparison as a single use. > >> Isn't that enough? > >> > >> > e.g. in > >> > > >> > void f11(double * restrict z, double * restrict w, double * > >> > restrict x, double * restrict y, int n) { > >> > for (int i = 0; i < n; i++) { > >> > z[i] = (w[i] > 0) ? w[i] : y[i] - w[i]; > >> > } > >> > } > >> > > >> > You have some of the same problem. It generates > >> > > >> > ld1d z0.d, p0/z, [x1, x2, lsl 3] > >> > fcmgt p2.d, p3/z, z0.d, #0.0 > >> > bic p1.b, p3/z, p0.b, p2.b > >> > ld1d z1.d, p1/z, [x3, x2, lsl 3] > >> > fsub z1.d, p1/m, z1.d, z0.d > >> > sel z0.d, p2, z0.d, z1.d > >> > st1d z0.d, p0, [x0, x2, lsl 3] > >> > incd x2 > >> > whilelo p0.d, w2, w4 > >> > > >> > which has two problems. fcmgt doesn't need to be predicated on p3 > >> > which is ptrue all, it can/should be p0. > >> > > >> > With that fixed the splitter won't match because p2 is needed in > >> > the sel, so it's not single use and so combine won't try to build > >> > the RTL so it can > >> be split. > >> > >> I think having the vectoriser avoid the dual use between the > >> IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since > that is > >> the only pass that has the context to prove that including the loop > >> mask in the VEC_COND_EXPR condition is correct. We already try to do > >> that to some > >> extent: > >> > > > > Sorry I have been looking at this these past couple of days and I just > > don't know how this is supposed to work. > > > > In the above example the problem is not just the use of p2 in the > > VEC_COND_EXPR. If the VEC_COND_EXPR is changed to use p1 then p1 > now > > has 3 uses which makes combine still not try the combination. > > > > But the general case > > > > void f10(double * restrict z, double * restrict w, double * restrict > > x, double * restrict y, int n) { > > for (int i = 0; i < n; i++) { > > z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i]; > > } > > } > > > > Produces > > > > f10: > > cmp w4, 0 > > ble .L1 > > mov x5, 0 > > whilelo p1.d, wzr, w4 > > ptrue p3.b, all > > .p2align 5,,15 > > .L3: > > ld1d z1.d, p1/z, [x1, x5, lsl 3] > > fcmgt p2.d, p1/z, z1.d, #0.0 > > fcmgt p0.d, p3/z, z1.d, #0.0 > > ld1d z2.d, p2/z, [x2, x5, lsl 3] > > bic p0.b, p3/z, p1.b, p0.b > > ld1d z0.d, p0/z, [x3, x5, lsl 3] > > fsub z0.d, p0/m, z0.d, z1.d > > movprfx z0.d, p2/m, z1.d > > fadd z0.d, p2/m, z0.d, z2.d > > st1d z0.d, p1, [x0, x5, lsl 3] > > incd x5 > > whilelo p1.d, w5, w4 > > b.any .L3 > > > > where the VEC_COND_EXPR has been elided. > > > > The problem is that the comparison for the inverse case is unpredicated. > > Yeah, for the original f10 example I was suggesting that we use combine > instead of changing the vectoriser. It turns out that the 3->2 split thing I > mentioned above won't work though, because we match the BIC first. > But I think something like: > > (define_insn_and_split "*inverted_fcmgt" > [(set (match_operand:<VPRED> 0 "register_operand" "=Upa") > (and:<VPRED> > (and:<VPRED> > (not:<VPRED> > (unspec:<VPRED> > [(match_operand:<VPRED> 1 "register_operand" "Upl") > (const_int SVE_KNOWN_PTRUE) > (match_operand:SVE_FULL_F 2 "register_operand" "w") > (match_operand:SVE_FULL_F 3 > "aarch64_simd_reg_or_zero" "wDz")] > SVE_COND_FP_CMP_I0)) > (match_operand:<VPRED> 4 "register_operand" "w")) > (match_dup 1)))] > "TARGET_SVE && !reg_overlap_mentioned_p (operands[4], operands[0])" > "#" > "&& 1" > [(set (match_dup 0) > (unspec:<VPRED> > [(match_dup 4) > (const_int SVE_MAYBE_NOT_PTRUE) > (match_dup 2) > (match_dup 3)] > SVE_COND_FP_CMP_I0)) > (set (match_dup 0) > (and:<VPRED> (not:<VPRED> (match_dup 0)) (match_dup 4)))] > ) > > is a legitimate optimisation in its own right because it gets rid of the PTRUE > operand (at split time). This would be a good thing to do wherever the > FCMxx and BIC come from.
Ah I see... I was trying with ;; Make sure that inversions of masked comparisons are always on the mask ;; instead of on the operation. (define_insn_and_split "*mask_inv_combine" [(match_scratch:<VPRED> 5 "=Upa, Upa") (set (match_operand:<VPRED> 0 "register_operand" "=Upa, Upa") (and:<VPRED> (and:<VPRED> (not:<VPRED> (unspec:<VPRED> [(match_operand:<VPRED> 1) (const_int SVE_KNOWN_PTRUE) (match_operand:SVE_FULL_F 2 "register_operand" "w, w") (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "Dz, w")] SVE_COND_FP_CMP_I0)) (match_operand:<VPRED> 4 "register_operand" "Upa, Upa")) (match_dup:<VPRED> 1)))] "TARGET_SVE" "#" "&& 1" [(set (match_dup 0) (and:<VPRED> (unspec:<VPRED> [(match_dup 1) (const_int SVE_KNOWN_PTRUE) (match_dup 2) (match_dup 3)] SVE_COND_FP_CMP_I0) (match_dup 4))) (set (match_dup 0) (and:<VPRED> (not:<VPRED> (match_dup 0)) (match_dup 4)))] ) But the difference seems to be I need to use (const_int SVE_MAYBE_NOT_PTRUE) After the split. Could you perhaps point me to what this means in RTL? The predicate itself is operand1 so just want to know what exactly this does to the semantics. > > The snag is that we don't then CSE the comparisons between split1 and RA. > The post-RA optimisers might then leave a move. E.g. for plain -O3 the > above pattern gives the expected: > > cmp w4, 0 > ble .L1 > mov x5, 0 > cntd x6 > whilelo p0.d, wzr, w4 > .p2align 3,,7 > .L3: > ld1d z1.d, p0/z, [x1, x5, lsl 3] > fcmgt p2.d, p0/z, z1.d, #0.0 > not p1.b, p0/z, p2.b > ld1d z2.d, p2/z, [x2, x5, lsl 3] > ld1d z0.d, p1/z, [x3, x5, lsl 3] > fsub z0.d, p1/m, z0.d, z1.d > movprfx z0.d, p2/m, z1.d > fadd z0.d, p2/m, z0.d, z2.d > st1d z0.d, p0, [x0, x5, lsl 3] > add x5, x5, x6 > whilelo p0.d, w5, w4 > b.any .L3 > .L1: > ret > > but -O3 -fno-trapping-math has a predicate move: > > cmp w4, 0 > ble .L1 > mov x5, 0 > cntd x6 > whilelo p0.d, wzr, w4 > .p2align 3,,7 > .L3: > ld1d z1.d, p0/z, [x1, x5, lsl 3] > fcmgt p1.d, p0/z, z1.d, #0.0 > mov p2.b, p1.b > not p1.b, p0/z, p1.b > ld1d z0.d, p2/z, [x2, x5, lsl 3] > ld1d z2.d, p1/z, [x3, x5, lsl 3] > fadd z0.d, z1.d, z0.d > movprfx z0.d, p1/m, z2.d > fsub z0.d, p1/m, z0.d, z1.d > st1d z0.d, p0, [x0, x5, lsl 3] > add x5, x5, x6 > whilelo p0.d, w5, w4 > b.any .L3 > .L1: > ret > > It would be good to fix this in RTL “somehow”. > > But for f11 we still get: > > f11: > .LFB1: > .cfi_startproc > cmp w4, 0 > ble .L6 > mov x2, 0 > cntd x5 > whilelo p1.d, wzr, w4 > ptrue p3.b, all > .p2align 3,,7 > .L8: > ld1d z0.d, p1/z, [x1, x2, lsl 3] > fcmgt p0.d, p1/z, z0.d, #0.0 > fcmgt p2.d, p3/z, z0.d, #0.0 > not p0.b, p1/z, p0.b > ld1d z1.d, p0/z, [x3, x2, lsl 3] > fsub z1.d, p0/m, z1.d, z0.d > sel z0.d, p2, z0.d, z1.d > st1d z0.d, p1, [x0, x2, lsl 3] > add x2, x2, x5 > whilelo p1.d, w2, w4 > b.any .L8 > .L6: > ret > > This is where the VEC_COND_EXPR code I mentioned should come in. > At the moment, before generating: > > VEC_COND_EXPR<C, E1, E2> > > we check whether there are any other operations predicated on C & > LOOP_MASK. If so, we use: > > VEC_COND_EXPR<C & LOOP_MASK, E1, E2> > > instead. This avoids both C and C & LOOP_MASK being live at the same time. > > The change I was suggesting was that we should also check whether there > are any operations predicated on ~C & LOOP_MASK. If so, we should > generate: > > VEC_COND_EXPR<~C & LOOP_MASK, E2, E1> > > Again, the justification is that we don't have C and ~C & LOOP_MASK live at > the same time. If C itself has the form ~C' then swapping the operands also > saves a negation. > > > This code is fine as far as I can tell. But there's nothing you can > > do here. The mask it needs is ~original So it does not find an inverse mask > to use because it has to honor floating point exceptions. > > > > And indeed `-fno-trapping-math` or `-Ofast` generates the most optimal > > sequence, but when honoring traps it can't re-use invert existing mask, > which leaves the operation unpredicated. > > > > So is what you're requesting that it looks inside unary operators and tries > > to > CSE the thing they're pointed to? > > [Hopefully answered this above] Yes it does, along with the split above. I had originally thought you meant the VEC_COND_EXPR should take care of both problems.. > > > In which case isn't it about the same as what I had before just that the > vectorizer did the CSE itself? > > IMO the VEC_COND_EXPR optimisation is more selective. It only changes > the gimple if we know that it will have an effect (in terms of reducing the > number of SSA_NAMEs that have multiple uses). Also, unlike the original > patch, there's no double-masking invovled: ~C & LOOP_MASK will only apply > LOOP_MASK once. > > > If that's the case maybe it's better to do lookups into loop_vinfo- > >scalar_cond_masked_set in prepare_load_store_mask? > > So that it just applies to everything? > > scalar_cond_masked_set only exists to convert a mask that ignores > LOOP_MASK into one that is ANDed with LOOP_MASK. For loads and stores > we always AND with LOOP_MASK, so a lookup isn't necessary. Fair enough, Thanks for the explanation! I'll keep this in mind for the other patches. Thanks, Tamar > > Thanks, > Richard