> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Wednesday, May 15, 2024 10:31 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Richard Biener <richard.guent...@gmail.com>; gcc-patches@gcc.gnu.org; nd > <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus > Shawcroft <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org > Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain > operations. > > Tamar Christina <tamar.christ...@arm.com> writes: > >> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina > >> >> <tamar.christ...@arm.com> wrote: > >> >> > > >> >> > Hi All, > >> >> > > >> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that > state > >> >> > that for predicated operations that also produce a predicate it is > >> >> > preferred > >> >> > that the codegen should use a different register for the destination > >> >> > than > that > >> >> > of the input predicate in order to avoid a performance overhead. > >> >> > > >> >> > This of course has the problem that it increases register pressure > >> >> > and so > >> should > >> >> > be done with care. Additionally not all micro-architectures have this > >> >> > consideration and so it shouldn't be done as a default thing. > >> >> > > >> >> > The patch series adds support for doing conditional early clobbers > >> >> > through > a > >> >> > combination of new alternatives and attributes to control their > >> >> > availability. > >> >> > >> >> You could have two alternatives, one with early clobber and one with > >> >> a matching constraint where you'd disparage the matching constraint one? > >> >> > >> > > >> > Yeah, that's what I do, though there's no need to disparage the non-early > clobber > >> > alternative as the early clobber alternative will naturally get a > >> > penalty if it > needs a > >> > reload. > >> > >> But I think Richard's suggestion was to disparage the one with a matching > >> constraint (not the earlyclobber), to reflect the increased cost of > >> reusing the register. > >> > >> We did take that approach for gathers, e.g.: > >> > >> [&w, Z, w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s] > >> [?w, Z, 0, Ui1, Ui1, Upl] ^ > >> > >> The (supposed) advantage is that, if register pressure is so tight > >> that using matching registers is the only alternative, we still > >> have the opportunity to do that, as a last resort. > >> > >> Providing only an earlyclobber version means that using the same > >> register is prohibited outright. If no other register is free, the RA > >> would need to spill something else to free up a temporary register. > >> And it might then do the equivalent of (pseudo-code): > >> > >> not p1.b, ..., p0.b > >> mov p0.d, p1.d > >> > >> after spilling what would otherwise have occupied p1. In that > >> situation it would be better use: > >> > >> not p0.b, ..., p0.b > >> > >> and not introduce the spill of p1. > > > > I think I understood what Richi meant, but I thought it was already working > > that > way. > > The suggestion was to use matching constraints (like "0") though, > whereas the patch doesn't. I think your argument is that you don't > need to use matching constraints. But that's different from the > suggestion (and from how we handle gathers). > > I was going to say in response to patch 3 (but got distracted, sorry): > I don't think we should have: > > &Upa, Upa, ... > Upa, Upa, ... > > (taken from the pure logic ops) enabled at the same time. Even though > it works for the testcases, I don't think it has well-defined semantics. > > The problem is that, taken on its own, the second alternative says that > matching operands are free. And fundamentally, I don't think the costs > *must* take the earlyclobber alternative over the non-earlyclobber one > (when costing during IRA, for instance). In principle, the cheapest > is best. > > The aim of the gather approach is to make each alternative correct in > isolation. In: > > [&w, Z, w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s] > [?w, Z, 0, Ui1, Ui1, Upl] ^ > > the second alternative says that it is possible to have operands 0 > and 2 be the same vector register, but using that version has the > cost of an extra reload. In that sense the alternatives are > (essentially) consistent about the restriction. > > > i.e. as one of the testcases I had: > > > >> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 > >> -ffixed- > p[1-15] > > > > foo: > > mov z31.h, w0 > > ptrue p0.b, all > > cmplo p0.h, p0/z, z0.h, z31.h > > b use > > > > and reload did not force a spill. > > > > My understanding of how this works, and how it seems to be working is that > since reload costs > > Alternative from front to back the cheapest one wins and it stops > > evaluating the > rest. > > > > The early clobber case is first and preferred, however when it's not > > possible, i.e. > requires a non-pseudo > > reload, the reload cost is added to the alternative. > > > > However you're right that in the following testcase: > > > > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 > > -ffixed- > p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 > -ffixed-p12 - > ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload > > > > i.e. giving it an extra free register inexplicably causes a spill: > > > > foo: > > addvl sp, sp, #-1 > > mov z31.h, w0 > > ptrue p0.b, all > > str p15, [sp] > > cmplo p15.h, p0/z, z0.h, z31.h > > mov p0.b, p15.b > > ldr p15, [sp] > > addvl sp, sp, #1 > > b use > > > > so that's unexpected and is very weird as p15 has no defined value.. > > This is because the function implicitly uses the SVE PCS, and so needs > to preserve p15 for the caller. It looks like the correct behaviour.
Sure, but p15 isn't live after the call. It is somewhat of a regression in that if it had chosen the tie version then p0 wouldn't need preserving. It's a bit of an artificial case I guess but are we ok with this regression? Or is there a way to query df to see if a value is live after the call? I can only see ways to tell if the register is live before the call.. Thanks, Tamar > > > Now adding the ? as suggested to the non-early clobber alternative does not > > fix > it, and my mental model for how this is supposed to work does not quite line > up.. > > Why would making the non-clobber alternative even more expensive help it > during high register pressure?? > > Hopefully the above answers this. The non-clobber alternative has > zero extra cost as things stand. The costs from one alternative > (the earlyclobber one) don't carry forward to other alternatives. > > > But with that suggestion the above case does not get fixed > > and the following case > > > > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 > > -ffixed- > p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 > -ffixed-p12 - > ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload > > > > ICEs: > > > > pred-clobber.c: In function 'foo': > > pred-clobber.c:9:1: error: unable to find a register to spill > > 9 | } > > | ^ > > pred-clobber.c:9:1: error: this is the insn: > > (insn 10 22 19 2 (parallel [ > > (set (reg:VNx8BI 110 [104]) > > (unspec:VNx8BI [ > > (reg:VNx8BI 112) > > (const_int 1 [0x1]) > > (ltu:VNx8BI (reg:VNx8HI 32 v0) > > (reg:VNx8HI 63 v31)) > > ] UNSPEC_PRED_Z)) > > (clobber (reg:CC_NZC 66 cc)) > > ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi} > > (expr_list:REG_DEAD (reg:VNx8BI 112) > > (expr_list:REG_DEAD (reg:VNx8HI 63 v31) > > (expr_list:REG_DEAD (reg:VNx8HI 32 v0) > > (expr_list:REG_UNUSED (reg:CC_NZC 66 cc) > > (nil)))))) > > during RTL pass: reload > > dump file: pred-clobber.c.315r.reload > > Which pattern did you use? > > > and this is because the use of ? has the unintended side-effect of blocking > > a > register class entirely during Sched1 as we've recently discovered. > > i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766 > > (Is sched1 the problem here, or is it purely an RA thing? What happens > when scheduling is disabled?) > > > in this case it marked the alternative as NO_REGS during sched1 and so it's > completely dead. > > the use of the ? alternatives has caused quite the code bloat as we've > > recently > discovered because of this unexpected and undocumented behavior. > > > > To me, > > > > diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64- > sve.md > > index 93ec59e58af..2ee3d8ea35e 100644 > > --- a/gcc/config/aarch64/aarch64-sve.md > > +++ b/gcc/config/aarch64/aarch64-sve.md > > @@ -8120,10 +8120,10 @@ (define_insn > "@aarch64_pred_cmp<cmp_op><mode>" > > (clobber (reg:CC_NZC CC_REGNUM))] > > "TARGET_SVE" > > {@ [ cons: =0 , 1 , 3 , 4 ; attrs: pred_clobber ] > > - [ &Upa , Upl , w , <sve_imm_con>; yes ] > cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4 > > - [ Upa , Upl , w , <sve_imm_con>; * ] ^ > > - [ &Upa , Upl , w , w ; yes ] > > cmp<cmp_op>\t%0.<Vetype>, > %1/z, %3.<Vetype>, %4.<Vetype> > > - [ Upa , Upl , w , w ; * ] ^ > > + [ ^&Upa , Upl , w , <sve_imm_con>; yes ] > cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4 > > + [ Upa , Upl , w , <sve_imm_con>; * ] ^ > > + [ ^&Upa , Upl , w , w ; yes ] > > cmp<cmp_op>\t%0.<Vetype>, > %1/z, %3.<Vetype>, %4.<Vetype> > > + [ Upa , Upl , w , w ; * ] ^ > > } > > ) > > > > Would have been the right approach, i.e. we prefer the alternative unless a > > reload > is needed, which should work no? well. if ^ wasn't broken the same way > > as ?. Perhaps I need to use Wilco's new alternative that doesn't block a > > register > class? > > Hmm, I'm not sure. It seems odd to mark only the output with ^, since > reloading the output isn't fundamentally different (costwise) from > reloading the input. > > But to me, it's the alternative without the earlyclobber that should be > disparaged, since it's the inherently expensive one. > > The gather-like approach would be something like: > > [ &Upa , Upl , w , <sve_imm_con>; yes ] > cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4 > [ ?Upl , 0 , w , <sve_imm_con>; yes ] ^ > [ Upa , Upl , w , <sve_imm_con>; no ] ^ > [ &Upa , Upl , w , w ; yes ] > cmp<cmp_op>\t%0.<Vetype>, %1/z, > %3.<Vetype>, %4.<Vetype> > [ ?Upl , 0 , w , w ; yes ] ^ > [ Upa , Upl , w , w ; no ] ^ > > with: > > (define_attr "pred_clobber" "any,no,yes" (const_string "any")) > > Thanks, > Richard