On Fri, Dec 13, 2024 at 9:57 AM Jakub Jelinek <ja...@redhat.com> wrote: > > Hi! > > As mentioned in the PR, the addition of vec_addsubv2sf3 expander caused > the testcase to be vectorized and no longer to use fma. > The following patch adds new expanders so that it can be vectorized > again with the alternating add/sub fma instructions. > > There is some bug on the slp cost computation side which causes it > not to count some scalar multiplication costs, but I think the patch > is desirable anyway before that is fixed and the testcase for now just > uses -fvect-cost-model=unlimited. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > 2024-12-13 Jakub Jelinek <ja...@redhat.com> > > PR target/116979 > * config/i386/mmx.md (vec_fmaddsubv2sf4, vec_fmsubaddv2sf4): New > define_expand patterns. > > * gcc.target/i386/pr116979.c: New test.
OK with a small test adjustment (scan string target). Thanks, Uros. > > --- gcc/config/i386/mmx.md.jj 2024-12-12 19:46:50.651306295 +0100 > +++ gcc/config/i386/mmx.md 2024-12-12 20:15:39.502007436 +0100 > @@ -1132,6 +1132,54 @@ (define_expand "vec_addsubv2sf3" > DONE; > }) > > +(define_expand "vec_fmaddsubv2sf4" > + [(match_operand:V2SF 0 "register_operand") > + (match_operand:V2SF 1 "nonimmediate_operand") > + (match_operand:V2SF 2 "nonimmediate_operand") > + (match_operand:V2SF 3 "nonimmediate_operand")] > + "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) > + && TARGET_MMX_WITH_SSE > + && ix86_partial_vec_fp_math" > +{ > + rtx op3 = gen_reg_rtx (V4SFmode); > + rtx op2 = gen_reg_rtx (V4SFmode); > + rtx op1 = gen_reg_rtx (V4SFmode); > + rtx op0 = gen_reg_rtx (V4SFmode); > + > + emit_insn (gen_movq_v2sf_to_sse (op3, operands[3])); > + emit_insn (gen_movq_v2sf_to_sse (op2, operands[2])); > + emit_insn (gen_movq_v2sf_to_sse (op1, operands[1])); > + > + emit_insn (gen_vec_fmaddsubv4sf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (V2SFmode, op0, V4SFmode)); > + DONE; > +}) > + > +(define_expand "vec_fmsubaddv2sf4" > + [(match_operand:V2SF 0 "register_operand") > + (match_operand:V2SF 1 "nonimmediate_operand") > + (match_operand:V2SF 2 "nonimmediate_operand") > + (match_operand:V2SF 3 "nonimmediate_operand")] > + "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) > + && TARGET_MMX_WITH_SSE > + && ix86_partial_vec_fp_math" > +{ > + rtx op3 = gen_reg_rtx (V4SFmode); > + rtx op2 = gen_reg_rtx (V4SFmode); > + rtx op1 = gen_reg_rtx (V4SFmode); > + rtx op0 = gen_reg_rtx (V4SFmode); > + > + emit_insn (gen_movq_v2sf_to_sse (op3, operands[3])); > + emit_insn (gen_movq_v2sf_to_sse (op2, operands[2])); > + emit_insn (gen_movq_v2sf_to_sse (op1, operands[1])); > + > + emit_insn (gen_vec_fmsubaddv4sf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (V2SFmode, op0, V4SFmode)); > + DONE; > +}) > + > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > ;; > ;; Parallel single-precision floating point comparisons > --- gcc/testsuite/gcc.target/i386/pr116979.c.jj 2024-12-12 20:19:18.179934902 > +0100 > +++ gcc/testsuite/gcc.target/i386/pr116979.c 2024-12-12 20:21:31.685059095 > +0100 > @@ -0,0 +1,24 @@ > +/* PR target/116979 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mfma -fvect-cost-model=unlimited" } */ > +/* { dg-final { scan-assembler "vfmaddsub(?:132|213|231)pd" } } */ > +/* { dg-final { scan-assembler "vfmaddsub(?:132|213|231)ps" { target lp64 } > } } */ /* dg-final { scan-assembler "..." { target { ! ia32 } } } } */ x32 is TARGET_MMX_WITH_SSE ilp32 target and is able to vectorize the testcase as well. > + > +struct S { __complex__ float f; }; > +struct T { __complex__ double f; }; > + > +struct S > +foo (const struct S *a, const struct S *b) > +{ > + struct S r; > + r.f = a->f * b->f; > + return r; > +} > + > +struct T > +bar (const struct T *a, const struct T *b) > +{ > + struct T r; > + r.f = a->f * b->f; > + return r; > +} > > Jakub >