Re: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with floating-point mode as trapping

Kyrylo Tkachov Mon, 03 Mar 2025 01:22:08 -0800


> On 3 Mar 2025, at 09:49, Andrew Pinski <pins...@gmail.com> wrote:
> 
> On Mon, Mar 3, 2025 at 12:43 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>> 
>> 
>> 
>>> On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote:
>>> 
>>> On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> In this PR late-combine was failing to merge:
>>>> dup v31.4s, v31.s[3]
>>>> fmla v30.4s, v31.4s, v29.4s
>>>> into the lane-wise fmla form.
>>>> This is because late-combine checks may_trap_p under the hood on the dup 
>>>> insn.
>>>> This ended up returning true for the insn:
>>>> (set (reg:V4SF 152 [ _32 ])
>>>>      (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>>>>              (parallel:V4SF [
>>>>                      (const_int 3 [0x3])]))))
>>>> 
>>>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
>>>> floating-point modes can't trap, it assumed that the V4SF parallel can 
>>>> trap.
>>>> The correct behaviour is to recurse into vector inside the PARALLEL and 
>>>> check
>>>> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
>>>> With this check the above insn is not deemed to be trapping and is 
>>>> propagated
>>>> into the FMLA giving:
>>>> 
>>>> fmla vD.4s, vA.4s, vB.s[3]
>>>> 
>>>> The testcase is reduced from a larger example so it has some C++ cruft 
>>>> around
>>>> the main part but still demonstrates the desired effect.
>>>> 
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>> Apparently this also fixes a regression in
>>>> gcc.target/aarch64/vmul_element_cost.c that I observed.
>>>> 
>>>> Ok for trunk?
>>> 
>>> This looks ok but here is a better/more reduced testcase without the
>>> extra C++ism:
>>> ```
>>> /* { dg-do compile } */
>>> /* { dg-additional-options "-O2" } */
>>> #include <arm_neon.h>
>>> 
>>> float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
>>> {
>>> float32x4_t t = a;
>>> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
>>> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
>>> return t;
>>> }
>>> /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } 
>>> } */
>>> /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s,
>>> v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */
>>> ```
>>> 
>> 
>> Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges 
>> the lane forms as is.
>> What we want is a case where combine refuses to do the merging but late 
>> combine does (or should be doing).
> 
> I just tested it on the trunk and we get dup:
> https://godbolt.org/z/Yfc7d1r1x
> 
> If I had used vgetq_lane_f32 with lane 0, then there would have been
> no dup as that would be represented as a subreg.
>


You’re right, I must have been misusing godbolt this morning.
I’ve sent an updated patch with the smaller test case at:
https://gcc.gnu.org/pipermail/gcc-patches/2025-March/676730.html

Thanks again,
Kyrill

> Thanks,
> Andrew
> 
> 
>> 
>> Thanks,
>> Kyrill
>> 
>>> Thanks,
>>> Andrew Pinski
>>> 
>>>> Thanks,
>>>> Kyrill
>>>> 
>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>> 
>>>> gcc/
>>>> 
>>>>      PR rtl-optimization/119046
>>>>      * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
>>>> 
>>>> gcc/testsuite/
>>>> 
>>>>      PR rtl-optimization/119046
>>>>      * g++.target/aarch64/pr119046.C: New test.
>>>> 
>>

Re: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with floating-point mode as trapping

Reply via email to