On Mon, Mar 3, 2025 at 12:43 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>
>
>
> > On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote:
> >
> > On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> >>
> >> Hi all,
> >>
> >> In this PR late-combine was failing to merge:
> >> dup v31.4s, v31.s[3]
> >> fmla v30.4s, v31.4s, v29.4s
> >> into the lane-wise fmla form.
> >> This is because late-combine checks may_trap_p under the hood on the dup 
> >> insn.
> >> This ended up returning true for the insn:
> >> (set (reg:V4SF 152 [ _32 ])
> >>       (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
> >>               (parallel:V4SF [
> >>                       (const_int 3 [0x3])]))))
> >>
> >> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
> >> floating-point modes can't trap, it assumed that the V4SF parallel can 
> >> trap.
> >> The correct behaviour is to recurse into vector inside the PARALLEL and 
> >> check
> >> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
> >> With this check the above insn is not deemed to be trapping and is 
> >> propagated
> >> into the FMLA giving:
> >>
> >> fmla vD.4s, vA.4s, vB.s[3]
> >>
> >> The testcase is reduced from a larger example so it has some C++ cruft 
> >> around
> >> the main part but still demonstrates the desired effect.
> >>
> >> Bootstrapped and tested on aarch64-none-linux-gnu.
> >> Apparently this also fixes a regression in
> >> gcc.target/aarch64/vmul_element_cost.c that I observed.
> >>
> >> Ok for trunk?
> >
> > This looks ok but here is a better/more reduced testcase without the
> > extra C++ism:
> > ```
> > /* { dg-do compile } */
> > /* { dg-additional-options "-O2" } */
> > #include <arm_neon.h>
> >
> > float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
> > {
> > float32x4_t t = a;
> > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> > t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> > return t;
> > }
> > /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } 
> > } */
> > /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s,
> > v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */
> > ```
> >
>
> Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges 
> the lane forms as is.
> What we want is a case where combine refuses to do the merging but late 
> combine does (or should be doing).

I just tested it on the trunk and we get dup:
https://godbolt.org/z/Yfc7d1r1x

If I had used vgetq_lane_f32 with lane 0, then there would have been
no dup as that would be represented as a subreg.

Thanks,
Andrew


>
> Thanks,
> Kyrill
>
> > Thanks,
> > Andrew Pinski
> >
> >> Thanks,
> >> Kyrill
> >>
> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> >>
> >> gcc/
> >>
> >>       PR rtl-optimization/119046
> >>       * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as 
> >> trapping.
> >>
> >> gcc/testsuite/
> >>
> >>       PR rtl-optimization/119046
> >>       * g++.target/aarch64/pr119046.C: New test.
> >>
>

Reply via email to