> On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote: > > On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> Hi all, >> >> In this PR late-combine was failing to merge: >> dup v31.4s, v31.s[3] >> fmla v30.4s, v31.4s, v29.4s >> into the lane-wise fmla form. >> This is because late-combine checks may_trap_p under the hood on the dup >> insn. >> This ended up returning true for the insn: >> (set (reg:V4SF 152 [ _32 ]) >> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) >> (parallel:V4SF [ >> (const_int 3 [0x3])])))) >> >> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of >> floating-point modes can't trap, it assumed that the V4SF parallel can trap. >> The correct behaviour is to recurse into vector inside the PARALLEL and check >> the sub-expression. This patch adjusts may_trap_p_1 to do just that. >> With this check the above insn is not deemed to be trapping and is propagated >> into the FMLA giving: >> >> fmla vD.4s, vA.4s, vB.s[3] >> >> The testcase is reduced from a larger example so it has some C++ cruft around >> the main part but still demonstrates the desired effect. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Apparently this also fixes a regression in >> gcc.target/aarch64/vmul_element_cost.c that I observed. >> >> Ok for trunk? > > This looks ok but here is a better/more reduced testcase without the > extra C++ism: > ``` > /* { dg-do compile } */ > /* { dg-additional-options "-O2" } */ > #include <arm_neon.h> > > float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d) > { > float32x4_t t = a; > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); > return t; > } > /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } } > */ > /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, > v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */ > ``` >
Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges the lane forms as is. What we want is a case where combine refuses to do the merging but late combine does (or should be doing). Thanks, Kyrill > Thanks, > Andrew Pinski > >> Thanks, >> Kyrill >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> PR rtl-optimization/119046 >> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping. >> >> gcc/testsuite/ >> >> PR rtl-optimization/119046 >> * g++.target/aarch64/pr119046.C: New test. >>