https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Compared to gcc14 I have for example for cond_op_fma__Float16-1.c

foo1_fnms:
.LFB7:
        .cfi_startproc
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L24:
        vmovdqa b(%rax), %ymm1
        vmovdqa d(%rax), %ymm0
        addq    $32, %rax
        vcmpph  $1, c-32(%rax), %ymm1, %k1
        vmovdqa e-32(%rax), %ymm1
        vfnmsub213ph    a-32(%rax), %ymm0, %ymm1
        vmovdqu16       %ymm1, %ymm0{%k1}
        vmovdqa %ymm0, a-32(%rax)
        cmpq    $1600, %rax
        jne     .L24
        vzeroupper
        ret

instead of the expected

foo1_fnms:
.LFB7:
        .cfi_startproc
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L24:
        vmovdqa b(%rax), %ymm1
        vmovdqa a(%rax), %ymm2
        addq    $32, %rax
        vmovdqa d-32(%rax), %ymm0
        vcmpph  $1, c-32(%rax), %ymm1, %k1
        vfnmsub132ph    e-32(%rax), %ymm2, %ymm0{%k1}
        vmovdqa %ymm0, a-32(%rax)
        cmpq    $1600, %rax
        jne     .L24
        vzeroupper
        ret

.combine shows in gcc14:

Trying 15 -> 16:
   15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']}
   16: r99:V16HF=vec_merge(r113:V16HF,r102:V16HF,r110:HI)
      REG_DEAD r113:V16HF
      REG_DEAD r110:HI
      REG_DEAD r102:V16HF
Successfully matched this instruction:
(set (reg:V16HF 99 [ _37 ])
    (vec_merge:V16HF (fma:V16HF (neg:V16HF (reg:V16HF 102 [ vect_pretmp_14.315
]))
            (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ])
                    (symbol_ref:DI ("e") [flags 0x2]  <var_decl 0x7ffff6810ea0
e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]+0 S32
A256])
            (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ])
                        (symbol_ref:DI ("a") [flags 0x2]  <var_decl
0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.333_9
* 1]+0 S32 A256])))
        (reg:V16HF 102 [ vect_pretmp_14.315 ])
        (reg:HI 110 [ mask__11.325_55 ])))

but

Trying 15 -> 16:
   15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']}
   16: r99:V16HF=vec_merge(r113:V16HF,r104:V16HF,r110:HI)
      REG_DEAD r113:V16HF
      REG_DEAD r110:HI
      REG_DEAD r104:V16HF
Failed to match this instruction:
(set (reg:V16HF 99 [ _37 ])
    (vec_merge:V16HF (fma:V16HF (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [
ivtmp.329 ])
                        (symbol_ref:DI ("e") [flags 0x2]  <var_decl
0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9
* 1]+0 S32 A256]))
            (reg:V16HF 104 [ vect_pretmp_14.315 ])
            (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ])
                        (symbol_ref:DI ("a") [flags 0x2]  <var_decl
0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.329_9
* 1]+0 S32 A256])))
        (reg:V16HF 104 [ vect_pretmp_14.315 ])
        (reg:HI 110 [ mask__11.309_43 ])))

see how the commutative multiply part of insn 15 differs and causes the
matching to fail:

good:     15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']}
bad:      15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']}

this ordering is already present on GIMPLE:

  vect_pretmp_14.315_45 = MEM <vector(16) _Float16> [(_Float16 *)&d +
ivtmp.333_9 * 1];
  vect__5.322_52 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 *
1];
  _37 = .COND_FNMS (mask__11.325_55, vect_pretmp_14.315_45, vect__5.322_52,
vect__3.318_48, vect_pretmp_14.315_45);

vs.

  vect_pretmp_14.315_49 = MEM <vector(16) _Float16> [(_Float16 *)&d +
ivtmp.329_9 * 1];
  vect__5.312_46 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 *
1];
  _37 = .COND_FNMS (mask__11.309_43, vect__5.312_46, vect_pretmp_14.315_49,
vect__3.319_53, vect_pretmp_14.315_49);

both are canonicalized correctly (after SSA name version).

This is a spurious difference, if we rely on these combines for the now
missed micro-optimization we need to beef up the patterns to allow both
orders.  (avx512vl_fnmsub_v16hf_mask)

A target issue IMO?

Alternatively make sure RTL canonicalizes (fma (neg non-reg) (reg) ...)
to (fma (neg reg) (non-reg) ...) or stop matching that as pattern and
thus force RTL expansion + combine to arrive at the correct variant?

Reply via email to