RE: [PATCH]AArch64 relax predicate on load structure load instructions

Tamar Christina via Gcc-patches Mon, 13 Jun 2022 02:51:53 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Monday, June 13, 2022 9:38 AM
> To: Richard Sandiford <richard.sandif...@arm.com>
> Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org;
> nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>;
> Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> <kyrylo.tkac...@arm.com>; ro...@eyesopen.com
> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> instructions
> 
> On Mon, 13 Jun 2022, Richard Sandiford wrote:
> 
> > Richard Biener <rguent...@suse.de> writes:
> > > On Wed, 8 Jun 2022, Richard Sandiford wrote:
> > >> Tamar Christina <tamar.christ...@arm.com> writes:
> > >> >> -----Original Message-----
> > >> >> From: Richard Sandiford <richard.sandif...@arm.com>
> > >> >> Sent: Wednesday, June 8, 2022 11:31 AM
> > >> >> To: Tamar Christina <tamar.christ...@arm.com>
> > >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> > >> >> <richard.earns...@arm.com>; Marcus Shawcroft
> > >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> > >> >> <kyrylo.tkac...@arm.com>
> > >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure
> > >> >> load instructions
> > >> >>
> > >> >> Tamar Christina <tamar.christ...@arm.com> writes:
> > >> >> > Hi All,
> > >> >> >
> > >> >> > At some point in time we started lowering the ld1r instructions in
> gimple.
> > >> >> >
> > >> >> > That is:
> > >> >> >
> > >> >> > uint8x8_t f1(const uint8_t *in) {
> > >> >> >     return vld1_dup_u8(&in[1]); }
> > >> >> >
> > >> >> > generates at gimple:
> > >> >> >
> > >> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
> > >> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
> > >> >> >
> > >> >> > Which is good, but we then generate:
> > >> >> >
> > >> >> > f1:
> > >> >> >     ldr     b0, [x0, 1]
> > >> >> >     dup     v0.8b, v0.b[0]
> > >> >> >     ret
> > >> >> >
> > >> >> > instead of ld1r.
> > >> >> >
> > >> >> > The reason for this is because the load instructions have a
> > >> >> > too restrictive predicate on them which causes combine not to
> > >> >> > be able to combine the instructions due to the predicate only
> > >> >> > accepting simple
> > >> >> addressing modes.
> > >> >> >
> > >> >> > This patch relaxes the predicate to accept any memory operand
> > >> >> > and relies on LRA to legitimize the address when it needs to
> > >> >> > as the constraint still only allows the simple addressing
> > >> >> > mode.  Reload is always able to legitimize to these.
> > >> >> >
> > >> >> > Secondly since we are now actually generating more ld1r it
> > >> >> > became clear that the lane instructions suffer from a similar issue.
> > >> >> >
> > >> >> > i.e.
> > >> >> >
> > >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
> > >> >> >     float32x4_t dup = vld1q_dup_f32(&in[1]);
> > >> >> >     return vfmaq_laneq_f32 (a, a, dup, 1); }
> > >> >> >
> > >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
> > >> >> >
> > >> >> > The reason for this is similar to the ld1r issue.  The
> > >> >> > predicate is too restrictive in only acception register operands but
> not memory.
> > >> >> >
> > >> >> > This relaxes it to accept register and/or memory while leaving
> > >> >> > the constraint to only accept registers.  This will have LRA
> > >> >> > generate a reload if needed forcing the memory to registers
> > >> >> > using the standard
> > >> >> patterns.
> > >> >> >
> > >> >> > These two changes allow combine and reload to generate the
> > >> >> > right
> > >> >> sequences.
> > >> >> >
> > >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >> >>
> > >> >> This is going against the general direction of travel, which is
> > >> >> to make the instruction's predicates and conditions enforce the
> > >> >> constraints as much as possible (making optimistic assumptions
> about pseudo registers).
> > >> >>
> > >> >> The RA *can* deal with things like:
> > >> >>
> > >> >>   (match_operand:M N "general_operand" "r")
> > >> >>
> > >> >> but it's best avoided, for a few reasons:
> > >> >>
> > >> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
> > >> >>     registers.  This can make the allocation of those temporaries
> > >> >>     suboptimal but (more importantly) it might require other
> > >> >>     previously-allocated registers to be spilled late due to the
> > >> >>     unexpected increase in register pressure.
> > >> >>
> > >> >> (2) It ends up hiding instructions from the pre-RA optimisers.
> > >> >>
> > >> >> (3) It can also prevent combine opportunities (as well as create
> them),
> > >> >>     unless the loose predicates in an insn I are propagated to all
> > >> >>     patterns that might result from combining I with something else.
> > >> >>
> > >> >> It sounds like the first problem (not generating ld1r) could be
> > >> >> fixed by (a) combining aarch64_simd_dup<mode> and
> > >> >> *aarch64_simd_ld1r<mode>, so that the register and memory
> > >> >> alternatives are in the same pattern and (b) using the merged
> instruction(s) to implement the vec_duplicate optab.
> > >> >> Target-independent code should then make the address satisfy the
> > >> >> predicate, simplifying the address where necessary.
> > >> >>
> > >> >
> > >> > I think I am likely missing something here. I would assume that
> > >> > you wanted to use the optab to split the addressing off from the
> > >> > mem expression so the combined insn matches.
> > >> >
> > >> > But in that case, why do you need to combine the two instructions?
> > >> > I've tried and it doesn't work since the vec_duplicate optab
> > >> > doesn't see the mem as op1, because in gimple the mem is not part of
> the duplicate.
> > >> >
> > >> > So you still just see:
> > >> >
> > >> >>>> dbgrtx (ops[1].value)
> > >> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
> > >> >
> > >> > As the operand as the argument to the dup is just an SSA_NAME.
> > >>
> > >> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would come
> > >> from a constructor rather than a vec_duplicate_expr, so we don't
> > >> get the usual benefit of folding single-use mems during expand.
> > >>
> > >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html
> > >> moves towards using vec_duplicate even for fixed-length vectors.
> > >> If we take that approach, then I suppose a plain constructor should
> > >> be folded to a vec_duplicate where possible.
> > >>
> > >> (Alternatively, we could use an extended vec_perm_expr with scalar
> > >> inputs, as Richi suggested in that thread.)
> > >>
> > >> If we don't do that, or don't do it yet, then…
> > >
> > > I suppose since we alrady have vec_duplicate we can just use it ...
> > > what was the reason to not do this originally?
> >
> > There just wasn't any specific benefit for fixed-length vectors at the
> > time, and obvious potential problems -- introducing
> > VEC_DUPLICATE_EXPRs too early would lose out on existing
> CONSTRUCTOR-based folds.
> >
> > Also, isel didn't exist at the time that vec_duplicate was added, but
> > it seems like it might be a good place to do the replacement.
> >
> > Match rules that want to test for a uniform vector operand can already
> > use vec_same_elem_p to handle all representations, but perhaps we also
> > need a way of generating the “right” form of duplicate for the current
> > stage in the pass pipeline?
> 
> I think we can have vec_duplicate without native target support by
> expanding via CONSTRUCTOR, so vec_duplicate would be the correct one at
> all stages and we fixup during RTL expansion directly.
> 
> As you noted most targets don't implement vec_duplicate yet.
> 
> > > I suppose the
> > > vec_duplicate expander has a fallback via store_constuctor?
> > >
> > > Originally I wanted to avoid multiple ways to express the same thing
> > > but vec_duplicate is a common enough special-case and it also
> > > usually maps to a special instruction in vector ISAs.
> > > There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I
> suppose
> > > VIEW_CONVERT is more canonical here.
> >
> > Is that already true for V1m constructors?  (view_convert being
> > canonical and constructors not, I mean.)
> 
> I think so, yes.
> 
> > What do you think about the suggestion in the other thread of making
> > VEC_PERM_EXPR take an arbitrary number of inputs, with (as you
> > suggested) the inputs allowed to be scalars rather than vectors?
> > VEC_PERM_EXPR could then replace both CONSTRUCTOR and
> VEC_DUPLICATE_EXPR and “optimising”
> > a normal constructor to a duplicate would just be a case of removing
> > repeated scalar inputs.
> 
> It's indeed somewhat appealing to make VEC_PERM a swiss army knife.
> I'm not sure about making it a VL tree though, currently it's a nice GIMPLE
> ternary while VL would make it a SINGLE RHS with a GENERIC tree (unless we
> introduce a gimple_vec_perm special gimple node).  That said, allowing
> scalars as VEC_PERM inputs to get rid of VEC_DUPLICATE only will still leave
> us with the VIEW_CONVERT special case.
> 
> At some point we might want to help targets with "interesting"
> ISAs by lowering VEC_PERM to supported .VEC_PERM_CONSTs and relaxing
> what permutes we allow earlier in the pipeline (I'm thinking of x86 with its
> many special permutation ops and the open-coded vec-perm-const
> expander).
> 
> So not sure what to do, but I'm happy to widen VEC_DUPLICATE_EXPR use.


Just to check, this means detect VEC_DUPLICATE_EXPR during isel and convert
the CONSTRUCTOR to it?

Cheers,
Tamar

> 
> Richard.

RE: [PATCH]AArch64 relax predicate on load structure load instructions

Reply via email to