> On 3 Mar 2025, at 09:49, Andrew Pinski <pins...@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 12:43 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> >> >>> On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote: >>> >>> On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >>>> >>>> Hi all, >>>> >>>> In this PR late-combine was failing to merge: >>>> dup v31.4s, v31.s[3] >>>> fmla v30.4s, v31.4s, v29.4s >>>> into the lane-wise fmla form. >>>> This is because late-combine checks may_trap_p under the hood on the dup >>>> insn. >>>> This ended up returning true for the insn: >>>> (set (reg:V4SF 152 [ _32 ]) >>>> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) >>>> (parallel:V4SF [ >>>> (const_int 3 [0x3])])))) >>>> >>>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of >>>> floating-point modes can't trap, it assumed that the V4SF parallel can >>>> trap. >>>> The correct behaviour is to recurse into vector inside the PARALLEL and >>>> check >>>> the sub-expression. This patch adjusts may_trap_p_1 to do just that. >>>> With this check the above insn is not deemed to be trapping and is >>>> propagated >>>> into the FMLA giving: >>>> >>>> fmla vD.4s, vA.4s, vB.s[3] >>>> >>>> The testcase is reduced from a larger example so it has some C++ cruft >>>> around >>>> the main part but still demonstrates the desired effect. >>>> >>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>> Apparently this also fixes a regression in >>>> gcc.target/aarch64/vmul_element_cost.c that I observed. >>>> >>>> Ok for trunk? >>> >>> This looks ok but here is a better/more reduced testcase without the >>> extra C++ism: >>> ``` >>> /* { dg-do compile } */ >>> /* { dg-additional-options "-O2" } */ >>> #include <arm_neon.h> >>> >>> float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d) >>> { >>> float32x4_t t = a; >>> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); >>> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); >>> return t; >>> } >>> /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } >>> } */ >>> /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, >>> v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */ >>> ``` >>> >> >> Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges >> the lane forms as is. >> What we want is a case where combine refuses to do the merging but late >> combine does (or should be doing). > > I just tested it on the trunk and we get dup: > https://godbolt.org/z/Yfc7d1r1x > > If I had used vgetq_lane_f32 with lane 0, then there would have been > no dup as that would be represented as a subreg. >
You’re right, I must have been misusing godbolt this morning. I’ve sent an updated patch with the smaller test case at: https://gcc.gnu.org/pipermail/gcc-patches/2025-March/676730.html Thanks again, Kyrill > Thanks, > Andrew > > >> >> Thanks, >> Kyrill >> >>> Thanks, >>> Andrew Pinski >>> >>>> Thanks, >>>> Kyrill >>>> >>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>>> >>>> gcc/ >>>> >>>> PR rtl-optimization/119046 >>>> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping. >>>> >>>> gcc/testsuite/ >>>> >>>> PR rtl-optimization/119046 >>>> * g++.target/aarch64/pr119046.C: New test. >>>> >>