On Mon, 13 Jun 2022, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Wed, 8 Jun 2022, Richard Sandiford wrote: > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> >> -----Original Message----- > >> >> From: Richard Sandiford <richard.sandif...@arm.com> > >> >> Sent: Wednesday, June 8, 2022 11:31 AM > >> >> To: Tamar Christina <tamar.christ...@arm.com> > >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> >> <richard.earns...@arm.com>; Marcus Shawcroft > >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load > >> >> instructions > >> >> > >> >> Tamar Christina <tamar.christ...@arm.com> writes: > >> >> > Hi All, > >> >> > > >> >> > At some point in time we started lowering the ld1r instructions in > >> >> > gimple. > >> >> > > >> >> > That is: > >> >> > > >> >> > uint8x8_t f1(const uint8_t *in) { > >> >> > return vld1_dup_u8(&in[1]); > >> >> > } > >> >> > > >> >> > generates at gimple: > >> >> > > >> >> > _3 = MEM[(const uint8_t *)in_1(D) + 1B]; > >> >> > _4 = {_3, _3, _3, _3, _3, _3, _3, _3}; > >> >> > > >> >> > Which is good, but we then generate: > >> >> > > >> >> > f1: > >> >> > ldr b0, [x0, 1] > >> >> > dup v0.8b, v0.b[0] > >> >> > ret > >> >> > > >> >> > instead of ld1r. > >> >> > > >> >> > The reason for this is because the load instructions have a too > >> >> > restrictive predicate on them which causes combine not to be able to > >> >> > combine the instructions due to the predicate only accepting simple > >> >> addressing modes. > >> >> > > >> >> > This patch relaxes the predicate to accept any memory operand and > >> >> > relies on LRA to legitimize the address when it needs to as the > >> >> > constraint still only allows the simple addressing mode. Reload is > >> >> > always able to legitimize to these. > >> >> > > >> >> > Secondly since we are now actually generating more ld1r it became > >> >> > clear that the lane instructions suffer from a similar issue. > >> >> > > >> >> > i.e. > >> >> > > >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) { > >> >> > float32x4_t dup = vld1q_dup_f32(&in[1]); > >> >> > return vfmaq_laneq_f32 (a, a, dup, 1); } > >> >> > > >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla. > >> >> > > >> >> > The reason for this is similar to the ld1r issue. The predicate is > >> >> > too restrictive in only acception register operands but not memory. > >> >> > > >> >> > This relaxes it to accept register and/or memory while leaving the > >> >> > constraint to only accept registers. This will have LRA generate a > >> >> > reload if needed forcing the memory to registers using the standard > >> >> patterns. > >> >> > > >> >> > These two changes allow combine and reload to generate the right > >> >> sequences. > >> >> > > >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > >> >> > >> >> This is going against the general direction of travel, which is to make > >> >> the > >> >> instruction's predicates and conditions enforce the constraints as much > >> >> as > >> >> possible (making optimistic assumptions about pseudo registers). > >> >> > >> >> The RA *can* deal with things like: > >> >> > >> >> (match_operand:M N "general_operand" "r") > >> >> > >> >> but it's best avoided, for a few reasons: > >> >> > >> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary > >> >> registers. This can make the allocation of those temporaries > >> >> suboptimal but (more importantly) it might require other > >> >> previously-allocated registers to be spilled late due to the > >> >> unexpected increase in register pressure. > >> >> > >> >> (2) It ends up hiding instructions from the pre-RA optimisers. > >> >> > >> >> (3) It can also prevent combine opportunities (as well as create them), > >> >> unless the loose predicates in an insn I are propagated to all > >> >> patterns that might result from combining I with something else. > >> >> > >> >> It sounds like the first problem (not generating ld1r) could be fixed > >> >> by (a) > >> >> combining aarch64_simd_dup<mode> and *aarch64_simd_ld1r<mode>, so > >> >> that the register and memory alternatives are in the same pattern and > >> >> (b) > >> >> using the merged instruction(s) to implement the vec_duplicate optab. > >> >> Target-independent code should then make the address satisfy the > >> >> predicate, simplifying the address where necessary. > >> >> > >> > > >> > I think I am likely missing something here. I would assume that you > >> > wanted > >> > to use the optab to split the addressing off from the mem expression so > >> > the > >> > combined insn matches. > >> > > >> > But in that case, why do you need to combine the two instructions? > >> > I've tried and it doesn't work since the vec_duplicate optab doesn't see > >> > the > >> > mem as op1, because in gimple the mem is not part of the duplicate. > >> > > >> > So you still just see: > >> > > >> >>>> dbgrtx (ops[1].value) > >> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0) > >> > > >> > As the operand as the argument to the dup is just an SSA_NAME. > >> > >> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would > >> come from a constructor rather than a vec_duplicate_expr, so we don't > >> get the usual benefit of folding single-use mems during expand. > >> > >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html > >> moves towards using vec_duplicate even for fixed-length vectors. > >> If we take that approach, then I suppose a plain constructor > >> should be folded to a vec_duplicate where possible. > >> > >> (Alternatively, we could use an extended vec_perm_expr with > >> scalar inputs, as Richi suggested in that thread.) > >> > >> If we don't do that, or don't do it yet, then… > > > > I suppose since we alrady have vec_duplicate we can just use it ... > > what was the reason to not do this originally? > > There just wasn't any specific benefit for fixed-length vectors at the > time, and obvious potential problems -- introducing VEC_DUPLICATE_EXPRs > too early would lose out on existing CONSTRUCTOR-based folds. > > Also, isel didn't exist at the time that vec_duplicate was added, but it > seems like it might be a good place to do the replacement. > > Match rules that want to test for a uniform vector operand can already > use vec_same_elem_p to handle all representations, but perhaps we also > need a way of generating the “right” form of duplicate for the current > stage in the pass pipeline?
I think we can have vec_duplicate without native target support by expanding via CONSTRUCTOR, so vec_duplicate would be the correct one at all stages and we fixup during RTL expansion directly. As you noted most targets don't implement vec_duplicate yet. > > I suppose the > > vec_duplicate expander has a fallback via store_constuctor? > > > > Originally I wanted to avoid multiple ways to express the same thing > > but vec_duplicate is a common enough special-case and it also > > usually maps to a special instruction in vector ISAs. > > There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I > > suppose VIEW_CONVERT is more canonical here. > > Is that already true for V1m constructors? (view_convert being > canonical and constructors not, I mean.) I think so, yes. > What do you think about the suggestion in the other thread of making > VEC_PERM_EXPR take an arbitrary number of inputs, with (as you suggested) > the inputs allowed to be scalars rather than vectors? VEC_PERM_EXPR > could then replace both CONSTRUCTOR and VEC_DUPLICATE_EXPR and “optimising” > a normal constructor to a duplicate would just be a case of removing > repeated scalar inputs. It's indeed somewhat appealing to make VEC_PERM a swiss army knife. I'm not sure about making it a VL tree though, currently it's a nice GIMPLE ternary while VL would make it a SINGLE RHS with a GENERIC tree (unless we introduce a gimple_vec_perm special gimple node). That said, allowing scalars as VEC_PERM inputs to get rid of VEC_DUPLICATE only will still leave us with the VIEW_CONVERT special case. At some point we might want to help targets with "interesting" ISAs by lowering VEC_PERM to supported .VEC_PERM_CONSTs and relaxing what permutes we allow earlier in the pipeline (I'm thinking of x86 with its many special permutation ops and the open-coded vec-perm-const expander). So not sure what to do, but I'm happy to widen VEC_DUPLICATE_EXPR use. Richard.