> On 28 Feb 2025, at 19:06, Andrew Pinski <pins...@gmail.com> wrote:
> 
> On Fri, Feb 28, 2025 at 5:25 AM Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>> 
>> Hi all,
>> 
>> In this PR late-combine was failing to merge:
>> dup v31.4s, v31.s[3]
>> fmla v30.4s, v31.4s, v29.4s
>> into the lane-wise fmla form.
>> This is because late-combine checks may_trap_p under the hood on the dup 
>> insn.
>> This ended up returning true for the insn:
>> (set (reg:V4SF 152 [ _32 ])
>>       (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>>               (parallel:V4SF [
>>                       (const_int 3 [0x3])]))))
>> 
>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
>> floating-point modes can't trap, it assumed that the V4SF parallel can trap.
>> The correct behaviour is to recurse into vector inside the PARALLEL and check
>> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
>> With this check the above insn is not deemed to be trapping and is propagated
>> into the FMLA giving:
>> 
>> fmla vD.4s, vA.4s, vB.s[3]
>> 
>> The testcase is reduced from a larger example so it has some C++ cruft around
>> the main part but still demonstrates the desired effect.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Apparently this also fixes a regression in
>> gcc.target/aarch64/vmul_element_cost.c that I observed.
>> 
>> Ok for trunk?
> 
> This looks ok but here is a better/more reduced testcase without the
> extra C++ism:
> ```
> /* { dg-do compile } */
> /* { dg-additional-options "-O2" } */
> #include <arm_neon.h>
> 
> float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
> {
> float32x4_t t = a;
> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
> return t;
> }
> /* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[3\]\n} } } 
> */
> /* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s,
> v[0-9]+\.4s, v[0-9]+\.s\[3\]\n} 1 } } */
> ```
> 

Thanks, but this doesn’t reproduce the issue on trunk. Normal combine merges 
the lane forms as is.
What we want is a case where combine refuses to do the merging but late combine 
does (or should be doing).

Thanks,
Kyrill

> Thanks,
> Andrew Pinski
> 
>> Thanks,
>> Kyrill
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/
>> 
>>       PR rtl-optimization/119046
>>       * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
>> 
>> gcc/testsuite/
>> 
>>       PR rtl-optimization/119046
>>       * g++.target/aarch64/pr119046.C: New test.
>> 

Reply via email to