On Mon, Mar 3, 2025 at 12:43 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > > > > On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote: > > > > On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > >> > >> Hi all, > >> > >> In this PR late-combine was failing to merge: > >> dup v31.4s, v31.s[3] > >> fmla v30.4s, v31.4s, v29.4s > >> into the lane-wise fmla form. > >> This is because late-combine checks may_trap_p under the hood on the dup > >> insn. > >> This ended up returning true for the insn: > >> (set (reg:V4SF 152 [ _32 ]) > >> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) > >> (parallel:V4SF [ > >> (const_int 3 [0x3])])))) > >> > >> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of > >> floating-point modes can't trap, it assumed that the V4SF parallel can > >> trap. > >> The correct behaviour is to recurse into vector inside the PARALLEL and > >> check > >> the sub-expression. This patch adjusts may_trap_p_1 to do just that. > >> With this check the above insn is not deemed to be trapping and is > >> propagated > >> into the FMLA giving: > >> > >> fmla vD.4s, vA.4s, vB.s[3] > >> > >> The testcase is reduced from a larger example so it has some C++ cruft > >> around > >> the main part but still demonstrates the desired effect. > >> > >> Bootstrapped and tested on aarch64-none-linux-gnu. > >> Apparently this also fixes a regression in > >> gcc.target/aarch64/vmul_element_cost.c that I observed. > >> > >> Ok for trunk? > > > > This looks ok but here is a better/more reduced testcase without the > > extra C++ism: > > ``` > > /* { dg-do compile } */ > > /* { dg-additional-options "-O2" } */ > > #include <arm_neon.h> > > > > float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d) > > { > > float32x4_t t = a; > > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > > return t; > > } > > /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } > > } */ > > /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, > > v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */ > > ``` > > > > Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges > the lane forms as is. > What we want is a case where combine refuses to do the merging but late > combine does (or should be doing).
I just tested it on the trunk and we get dup: https://godbolt.org/z/Yfc7d1r1x If I had used vgetq_lane_f32 with lane 0, then there would have been no dup as that would be represented as a subreg. Thanks, Andrew > > Thanks, > Kyrill > > > Thanks, > > Andrew Pinski > > > >> Thanks, > >> Kyrill > >> > >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > >> > >> gcc/ > >> > >> PR rtl-optimization/119046 > >> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as > >> trapping. > >> > >> gcc/testsuite/ > >> > >> PR rtl-optimization/119046 > >> * g++.target/aarch64/pr119046.C: New test. > >> >