98350:Add a param to control the length of the chain with FMA in reassoc pass

Richard Biener via Gcc-patches Thu, 11 May 2023 03:54:46 -0700

On Thu, May 11, 2023 at 12:13 PM Cui, Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> From: Lili Cui <lili....@intel.com>
>
> Hi,
>
> Those two patches each add a param to control the length of the chain with
> FMA in reassoc pass and a tuning option in the backend.
>
> Bootstrapped and regtested. Ok for trunk?
>
> Regards
> Lili.
>
> Add a param for the chain with FMA in reassoc pass to make it more friendly to
> the fma pass later. First to detect if this chain has ability to
> generate more than 2 FMAs,if yes and param_reassoc_max_chain_length_with_fma
> is enabled, We will rearrange the ops so that they can be combined into more
> FMAs. When the chain length exceeds param_reassoc_max_chain_length_with_fma,
> build parallel chains according to given association width and try to keep FMA
> opportunity as much as possible.
>
> TEST1:
>
> float
> foo (float a, float b, float c, float d, float *e)
> {
>    return  *e  + a * b + c * d ;
> }
>
> For -Ofast -march=icelake-server  GCC generates:
>         vmulss  %xmm3, %xmm2, %xmm2
>         vfmadd132ss     %xmm1, %xmm2, %xmm0
>         vaddss  (%rdi), %xmm0, %xmm0
>         ret
>
> with "--param=reassoc-max-chain-length-with-fma=3" GCC generates:
>         vfmadd213ss   (%rdi), %xmm1, %xmm0
>         vfmadd231ss   %xmm2, %xmm3, %xmm0
>         ret
>
> gcc/ChangeLog:
>
>         PR gcc/98350
>         * params.opt (reassoc-max-fma-chain-length): New param.
>         * tree-ssa-reassoc.cc
>         (rewrite_expr_tree_parallel_for_fma): New.
>         (rank_ops_for_fma): Ditto.
>         (reassociate_bb): Handle new function.
>
> gcc/testsuite/ChangeLog:
>
>         PR gcc/98350
>         * gcc.dg/pr98350-1.c: New test.
>         * gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/params.opt                   |   4 +
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31 +++++
>  gcc/testsuite/gcc.dg/pr98350-2.c |  17 +++
>  gcc/tree-ssa-reassoc.cc          | 228 ++++++++++++++++++++++++++++---
>  4 files changed, 264 insertions(+), 16 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c
>
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 823cdb2ff85..f7c719afe64 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1182,4 +1182,8 @@ The maximum factor which the loop vectorizer applies to 
> the cost of statements i
>  Common Joined UInteger Var(param_vect_induction_float) Init(1) 
> IntegerRange(0, 1) Param Optimization
>  Enable loop vectorization of floating point inductions.
>
> +-param=reassoc-max-chain-length-with-fma=
> +Common Joined UInteger Var(param_reassoc_max_chain_length_with_fma) Init(1) 
> IntegerRange(1, 65536) Param Optimization
> +The maximum chain length with fma considered in reassociation pass.
> +
>  ; This comment is to ensure we retain the blank line above.
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c 
> b/gcc/testsuite/gcc.dg/pr98350-1.c
> new file mode 100644
> index 00000000000..32ecce13a2d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma 
> --param=reassoc-max-chain-length-with-fma=7 -Wno-attributes " } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */
> +#define N 1024
> +double a[N];
> +double b[N];
> +double c[N];
> +double d[N];
> +double e[N];
> +double f[N];
> +double g[N];
> +double h[N];
> +double j[N];
> +double k[N];
> +double l[N];
> +double m[N];
> +double o[N];
> +double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i = 0; i < N; i++)
> +  {
> +    a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
> l[i] + m[i]* o[i] + p[i];
> +  }
> +}
> +/* { dg-final { scan-assembler-times "vfm" 6  } } */
> diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c 
> b/gcc/testsuite/gcc.dg/pr98350-2.c
> new file mode 100644
> index 00000000000..246025d43b8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfpmath=sse -mfma 
> --param=reassoc-max-chain-length-with-fma=6 -Wno-attributes " } */
> +
> +/* Test that the compiler properly build parallel chains according to given
> +   association width and try to keep FMA opportunity as much as possible.  */
> +#define N 33
> +double a[N];
> +
> +void
> +foo (void)
> +{
> +  a[32] = a[0] *a[1] + a[2] * a[3] + a[4] * a[5] + a[6] * a[7] + a[8] * a[9]
> +    + a[10] * a[11] + a[12] * a[13] + a[14] * a[15] + a[16] * a[17]
> +    + a[18] * a[19] + a[20] * a[21] + a[22] * a[23] + a[24] + a[25]
> +    + a[26] + a[27] + a[28] + a[29] + a[30] + a[31];
> +}
> +/* { dg-final { scan-assembler-times "vfm" 12  } } */
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index 067a3f07f7e..6d2e158c4f5 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-reassoc.h"
>  #include "tree-ssa-math-opts.h"
>  #include "gimple-range.h"
> +#include "internal-fn.h"
>
>  /*  This is a simple global reassociation pass.  It is, in part, based
>      on the LLVM pass of the same name (They do some things more/less
> @@ -5468,6 +5469,114 @@ get_reassociation_width (int ops_num, enum tree_code 
> opc,
>    return width;
>  }
>
> +/* Rewrite statements with dependency chain with regard to the chance to
> +   generate FMA. When the dependency chain length exceeds
> +   param_max_reassoc_chain_length_with_fma, build parallel chains according 
> to
> +   given association width and try to keep fma opportunity as much as 
> possible.
> +   E.g.
> +   e + f + g + a * b + c * d;
> +
> +   ssa1 = e + f;
> +   ssa2 = g + a * b;
> +   ssa3 = ssa1 + c * d;
> +   ssa4 = ssa2 + ssa3;
> +
> +   This reassociation approach preserves the chance of fma generation as much
> +   as possible.  */
> +static void
> +rewrite_expr_tree_parallel_for_fma (gassign *stmt, int width,
> +                                        const vec<operand_entry *> &ops)
> +{
> +  enum tree_code opcode = gimple_assign_rhs_code (stmt);
> +  int op_num = ops.length ();
> +  gcc_assert (op_num > 0);
> +  int stmt_num = op_num - 1;
> +  gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> +  int op_index = op_num - 1;
> +  int width_count = width;
> +  int i = 0, j = 0;
> +  tree tmp_op[2], op1;
> +  operand_entry *oe;
> +  gimple *stmt1 = NULL;
> +  tree last_rhs1 = gimple_assign_rhs1 (stmt);
> +
> +  /* We start expression rewriting from the top statements.
> +     So, in this loop we create a full list of statements
> +     we will work with.  */
> +  stmts[stmt_num - 1] = stmt;
> +  for (i = stmt_num - 2; i >= 0; i--)
> +    stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> +
> +  /* Build parallel FMA dependency chain according to width.  */
> +  for (i = 0; i < width; i++)
> +    {
> +      for (j = 0; j < 2; j++)
> +       {
> +         oe = ops[op_index--];
> +         tmp_op[j] = oe->op;
> +         stmt1 = oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 = NULL;
> +       }
> +      stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], 
> tmp_op[0], opcode);
> +      gimple_set_visited (stmts[i], true);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, " into ");
> +         print_gimple_stmt (dump_file, stmts[i], 0);
> +       }
> +    }
> +
> +  for (i = width; i < stmt_num; i++)
> +    {
> +      /* We keep original statement only for the last one.  All others are
> +        recreated.  */
> +      if ( op_index < 0)
> +       {
> +         if (width_count == 2)
> +           {
> +
> +             gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs 
> (stmts[i-1]));
> +             gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs 
> (stmts[i-2]));
> +           }
> +         else
> +           {
> +
> +             stmts[i] =
> +               build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                  gimple_assign_lhs (stmts[i-width_count]),
> +                                  gimple_assign_lhs (stmts[i-width_count+1]),
> +                                  opcode);
> +             width_count--;
> +           }
> +         update_stmt (stmts[i]);
> +       }
> +      else
> +       {
> +         oe = ops[op_index--];
> +         op1 = oe->op;
> +         stmt1 = oe->stmt_to_insert;
> +         if (stmt1)
> +           insert_stmt_before_use (stmts[i], stmt1);
> +         stmt1 = NULL;
> +         stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1),
> +                                 gimple_assign_lhs (stmts[i-width]),
> +                                 op1,
> +                                 opcode);
> +         gimple_set_visited (stmts[i], true);
> +  }
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, " into ");
> +         print_gimple_stmt (dump_file, stmts[i], 0);
> +       }
> +    }
> +  remove_visited_stmt_chain (last_rhs1);
> +}
> +
>  /* Recursively rewrite our linearized statements so that the operators
>     match those in OPS[OPINDEX], putting the computation in rank
>     order and trying to allow operations to be executed in
> @@ -6649,6 +6758,64 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, 
> gimple *stmt,
>      }
>  }
>
> +/* Rearrange ops to generate more FMA when the chain may has more than 2 
> fmas.
> +   Putting ops that not def from mult in front can generate more fma.
> +   E.g.
> +   a * b + c * d + e generates:
> +
> +   _4  = c_9(D) * d_10(D);
> +   _12 = .FMA (a_7(D), b_8(D), _4);
> +   _11 = e_6(D) + _12;
> +
> +   Rtearrange ops to -> e + a * b + c * d generates:
> +
> +   _4  = .FMA (c_7(D), d_8(D), _3);
> +   _11 = .FMA (a_5(D), b_6(D), _4);
> + */
> +static bool
> +rank_ops_for_fma (vec<operand_entry *> *ops)
> +{
> +  operand_entry *oe;
> +  unsigned int i;
> +  auto_vec<operand_entry *> ops_mult;
> +  auto_vec<operand_entry *> ops_others;
> +
> +  FOR_EACH_VEC_ELT (*ops, i, oe)
> +    {
> +      if (TREE_CODE (oe->op) == SSA_NAME)
> +       {
> +         gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
> +         if (is_gimple_assign (def_stmt)
> +             && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
> +           ops_mult.safe_push (oe);
> +         else
> +           ops_others.safe_push (oe);
> +       }
> +      else
> +       ops_others.safe_push (oe);
> +    }
> +  /* When ops_mult.length == 2, like the following case,
> +
> +     a * b + c * d + e.
> +
> +     we need to rearrange the ops.
> +
> +     Putting ops that not def from mult in front can generate more fmas.  */
> +  if (ops_mult.length () >= 2)
> +    {
> +      /* If all ops are defined with mult, we don't need to rearrange them.  
> */
> +      if (ops_mult.length () != ops->length ())
> +       {
> +         ops->truncate (0);
> +         FOR_EACH_VEC_ELT (ops_mult, i, oe)
> +           ops->safe_push (oe);


As you are not changing the number of ops you should be able to use
quick_push here
and below.  You should be able to do

 ops->splice (ops_mult);
 ops->splice (ops_others);

as well.

> +         FOR_EACH_VEC_ELT (ops_others, i, oe)
> +           ops->safe_push (oe);
> +       }
> +      return true;
> +    }
> +  return false;
> +}
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.
>
> @@ -6813,6 +6980,7 @@ reassociate_bb (basic_block bb)
>                   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
>                   int ops_num = ops.length ();
>                   int width;
> +                 bool keep_fma_chain = false;
>
>                   /* For binary bit operations, if there are at least 3
>                      operands and the last operand in OPS is a constant,
> @@ -6826,36 +6994,64 @@ reassociate_bb (basic_block bb)
>                       && TREE_CODE (ops.last ()->op) == INTEGER_CST)
>                     std::swap (*ops[0], *ops[ops_num - 1]);
>
> +                 optimization_type opt_type = bb_optimization_type (bb);
> +
> +                 /* When enabling param_reassoc_max_chain_length_with_fma to
> +                    keep the chain with fma, rank_ops_for_fma will detect if
> +                    the chain has fmas and if so it will rearrange the ops.  
> */
> +                 if (param_reassoc_max_chain_length_with_fma > 1
> +                     && direct_internal_fn_supported_p (IFN_FMA,
> +                                                        TREE_TYPE (lhs),
> +                                                        opt_type)
> +                     && (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
> +                   {
> +                     keep_fma_chain = rank_ops_for_fma(&ops);
> +                   }
> +
> +                 int len = ops.length ();
>                   /* Only rewrite the expression tree to parallel in the
>                      last reassoc pass to avoid useless work back-and-forth
>                      with initial linearization.  */

we are doing the parallel rewrite only in the last reassoc pass, i think it
makes sense to do the same for reassoc-for-fma.

Why do the existing expr rewrites not work after re-sorting the ops?

>                   if (!reassoc_insert_powi_p
> -                     && ops.length () > 3
> +                     && len > 3
> +                     && (!keep_fma_chain
> +                         || (keep_fma_chain
> +                             && len > 
> param_reassoc_max_chain_length_with_fma))

in the case len < param_reassoc_max_chain_length_with_fma we have the
chain re-sorted but fall through to non-parallel rewrite.  I wonder if we do not
want to instead adjust the reassociation width?  I'd say it depends on the
number of mult cases in the chain (sth the re-sorting could have computed).
Why do we have two completely independent --params here?  Can you give
an example --param
value combination that makes "sense" and show how it is beneficial?

>                       && (width = get_reassociation_width (ops_num, rhs_code,
> -                                                          mode)) > 1)
> +                                                           mode)) > 1)
>                     {
> -                     if (dump_file && (dump_flags & TDF_DETAILS))
> -                       fprintf (dump_file,
> -                                "Width = %d was chosen for reassociation\n",
> -                                width);
> -                     rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> -                                                 width, ops);
> +                     if (keep_fma_chain)
> +                       {
> +                         if (dump_file && (dump_flags & TDF_DETAILS))
> +                           fprintf (dump_file,
> +                                    "Break chain len = %d into width for 
> FMA\n",
> +                                    len);
> +                         rewrite_expr_tree_parallel_for_fma
> +                           (as_a <gassign *> (stmt), width, ops);
> +                       }
> +                     else
> +                       {
> +                         if (dump_file && (dump_flags & TDF_DETAILS))
> +                           fprintf (dump_file,
> +                                    "Width = %d was chosen for 
> reassociation\n",
> +                                    width);
> +                         rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> +                                                     width, ops);
> +                       }
>                     }
>                   else
> -                    {
> -                      /* When there are three operands left, we want
> -                         to make sure the ones that get the double
> -                         binary op are chosen wisely.  */
> -                      int len = ops.length ();
> -                      if (len >= 3)
> +                   {
> +                     /* When there are three operands left, we want
> +                        to make sure the ones that get the double
> +                        binary op are chosen wisely.  */
> +                     if (len >= 3 && !keep_fma_chain)
>                         swap_ops_for_binary_stmt (ops, len - 3);
>
>                       new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
>                                                    powi_result != NULL
>                                                    || negate_result,
>                                                    len != orig_len);
> -                    }
> -
> +                   }
>                   /* If we combined some repeated factors into a
>                      __builtin_powi call, multiply that result by the
>                      reassociated operands.  */
> --
> 2.25.1
>

Re: [PATCH 1/2] PR gcc/98350:Add a param to control the length of the chain with FMA in reassoc pass

Reply via email to