> On 5 Mar 2025, at 11:14, Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 10:01 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>> Hi all, >>> >>> In this testcase late-combine was failing to merge: >>> dup v31.4s, v31.s[3] >>> fmla v30.4s, v31.4s, v29.4s >>> into the lane-wise fmla form. >>> This is because late-combine checks may_trap_p under the hood on the dup >>> insn. >>> This ended up returning true for the insn: >>> (set (reg:V4SF 152 [ _32 ]) >>> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) >>> (parallel:V4SF [ >>> (const_int 3 [0x3])])))) >>> >>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of >>> floating-point modes can't trap, it assumed that the V4SF parallel can trap. >>> The correct behaviour is to recurse into vector inside the PARALLEL and >>> check >>> the sub-expression. This patch adjusts may_trap_p_1 to do just that. >>> With this check the above insn is not deemed to be trapping and is >>> propagated >>> into the FMLA giving: >>> fmla vD.4s, vA.4s, vB.s[3] >>> >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> Apparently this also fixes a regression in >>> gcc.target/aarch64/vmul_element_cost.c that I observed. >>> >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>> >>> gcc/ >>> >>> PR rtl-optimization/119046 >>> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping. >>> >>> gcc/testsuite/ >>> >>> PR rtl-optimization/119046 >>> * gcc.target/aarch64/pr119046.c: New test. >>> >>> From 39fa1be73e8fed7fc673329e1854ba41587623c4 Mon Sep 17 00:00:00 2001 >>> From: Kyrylo Tkachov <ktkac...@nvidia.com> >>> Date: Thu, 27 Feb 2025 09:00:25 -0800 >>> Subject: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with >>> floating-point mode as trapping >>> >>> In this testcase late-combine was failing to merge: >>> dup v31.4s, v31.s[3] >>> fmla v30.4s, v31.4s, v29.4s >>> into the lane-wise fmla form. >>> This is because late-combine checks may_trap_p under the hood on the dup >>> insn. >>> This ended up returning true for the insn: >>> (set (reg:V4SF 152 [ _32 ]) >>> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ]) >>> (parallel:V4SF [ >>> (const_int 3 [0x3])])))) >>> >>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of >>> floating-point modes can't trap, it assumed that the V4SF parallel can trap. >>> The correct behaviour is to recurse into vector inside the PARALLEL and >>> check >>> the sub-expression. This patch adjusts may_trap_p_1 to do just that. >>> With this check the above insn is not deemed to be trapping and is >>> propagated >>> into the FMLA giving: >>> fmla vD.4s, vA.4s, vB.s[3] >>> >>> Bootstrapped and tested on aarch64-none-linux-gnu. >>> Apparently this also fixes a regression in >>> gcc.target/aarch64/vmul_element_cost.c that I observed. >>> >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>> >>> gcc/ >>> >>> PR rtl-optimization/119046 >>> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping. >> >> I don't object to this, if someone else agrees that it's the right fix. >> But the mode on the parallel looks iffy to me, in that the data is not >> floating-point data. VOIDmode would probably be more accurate and >> seems to be what x86 uses. > > I agree that looks iffy, the fix is still OK IMO - we shouldn't decide > that a parallel > traps based on its mode (which has no meaning - it's not even documented to > have a mode).
Thanks, I’ve tracked down where the V4SF PARALLEL is being created and I’ll post a separate patch to fix it after some testing. Kyrill > > Richard. > >> >> Thanks, >> Richard >> >>> >>> gcc/testsuite/ >>> >>> PR rtl-optimization/119046 >>> * gcc.target/aarch64/pr119046.c: New test. >>> --- >>> gcc/rtlanal.cc | 1 + >>> gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 ++++++++++++++++ >>> 2 files changed, 17 insertions(+) >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr119046.c >>> >>> diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc >>> index 8caffafdaa4..7ad67afb9fe 100644 >>> --- a/gcc/rtlanal.cc >>> +++ b/gcc/rtlanal.cc >>> @@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags) >>> return true; >>> break; >>> >>> + case PARALLEL: >>> case NEG: >>> case ABS: >>> case SUBREG: >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c >>> b/gcc/testsuite/gcc.target/aarch64/pr119046.c >>> new file mode 100644 >>> index 00000000000..aa5fa7c848c >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c >>> @@ -0,0 +1,16 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-additional-options "-O2" } */ >>> + >>> +#include <arm_neon.h> >>> + >>> +float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d) >>> +{ >>> + float32x4_t t = a; >>> + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); >>> + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d); >>> + return t; >>> +} >>> + >>> +/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n} >>> } } */ >>> +/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s, >>> v[0-9]+\.s\[1\]\n} 2 } } */ >>> +