https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|tree-optimization |target --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Compared to gcc14 I have for example for cond_op_fma__Float16-1.c foo1_fnms: .LFB7: .cfi_startproc xorl %eax, %eax .p2align 4,,10 .p2align 3 .L24: vmovdqa b(%rax), %ymm1 vmovdqa d(%rax), %ymm0 addq $32, %rax vcmpph $1, c-32(%rax), %ymm1, %k1 vmovdqa e-32(%rax), %ymm1 vfnmsub213ph a-32(%rax), %ymm0, %ymm1 vmovdqu16 %ymm1, %ymm0{%k1} vmovdqa %ymm0, a-32(%rax) cmpq $1600, %rax jne .L24 vzeroupper ret instead of the expected foo1_fnms: .LFB7: .cfi_startproc xorl %eax, %eax .p2align 4,,10 .p2align 3 .L24: vmovdqa b(%rax), %ymm1 vmovdqa a(%rax), %ymm2 addq $32, %rax vmovdqa d-32(%rax), %ymm0 vcmpph $1, c-32(%rax), %ymm1, %k1 vfnmsub132ph e-32(%rax), %ymm2, %ymm0{%k1} vmovdqa %ymm0, a-32(%rax) cmpq $1600, %rax jne .L24 vzeroupper ret .combine shows in gcc14: Trying 15 -> 16: 15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']} 16: r99:V16HF=vec_merge(r113:V16HF,r102:V16HF,r110:HI) REG_DEAD r113:V16HF REG_DEAD r110:HI REG_DEAD r102:V16HF Successfully matched this instruction: (set (reg:V16HF 99 [ _37 ]) (vec_merge:V16HF (fma:V16HF (neg:V16HF (reg:V16HF 102 [ vect_pretmp_14.315 ])) (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ]) (symbol_ref:DI ("e") [flags 0x2] <var_decl 0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]+0 S32 A256]) (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ]) (symbol_ref:DI ("a") [flags 0x2] <var_decl 0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.333_9 * 1]+0 S32 A256]))) (reg:V16HF 102 [ vect_pretmp_14.315 ]) (reg:HI 110 [ mask__11.325_55 ]))) but Trying 15 -> 16: 15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']} 16: r99:V16HF=vec_merge(r113:V16HF,r104:V16HF,r110:HI) REG_DEAD r113:V16HF REG_DEAD r110:HI REG_DEAD r104:V16HF Failed to match this instruction: (set (reg:V16HF 99 [ _37 ]) (vec_merge:V16HF (fma:V16HF (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ]) (symbol_ref:DI ("e") [flags 0x2] <var_decl 0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 * 1]+0 S32 A256])) (reg:V16HF 104 [ vect_pretmp_14.315 ]) (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ]) (symbol_ref:DI ("a") [flags 0x2] <var_decl 0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.329_9 * 1]+0 S32 A256]))) (reg:V16HF 104 [ vect_pretmp_14.315 ]) (reg:HI 110 [ mask__11.309_43 ]))) see how the commutative multiply part of insn 15 differs and causes the matching to fail: good: 15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']} bad: 15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']} this ordering is already present on GIMPLE: vect_pretmp_14.315_45 = MEM <vector(16) _Float16> [(_Float16 *)&d + ivtmp.333_9 * 1]; vect__5.322_52 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]; _37 = .COND_FNMS (mask__11.325_55, vect_pretmp_14.315_45, vect__5.322_52, vect__3.318_48, vect_pretmp_14.315_45); vs. vect_pretmp_14.315_49 = MEM <vector(16) _Float16> [(_Float16 *)&d + ivtmp.329_9 * 1]; vect__5.312_46 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 * 1]; _37 = .COND_FNMS (mask__11.309_43, vect__5.312_46, vect_pretmp_14.315_49, vect__3.319_53, vect_pretmp_14.315_49); both are canonicalized correctly (after SSA name version). This is a spurious difference, if we rely on these combines for the now missed micro-optimization we need to beef up the patterns to allow both orders. (avx512vl_fnmsub_v16hf_mask) A target issue IMO? Alternatively make sure RTL canonicalizes (fma (neg non-reg) (reg) ...) to (fma (neg reg) (non-reg) ...) or stop matching that as pattern and thus force RTL expansion + combine to arrive at the correct variant?