on 2019/3/8 下午6:57, Richard Biener wrote:
> On Fri, Mar 8, 2019 at 2:40 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497),
>> when we meet some code pattern like:
>>
>>    V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k]
>>    // V1...Vn of VECTOR_TYPE
>>
>> We can teach reassoc to transform it to:
>>
>>    Vs = (V1 + V2 + ... + Vn)
>>    Vs[0] + Vs[1] + ... + Vs[k]
>>
>> It saves addition and bit_field_ref operations and exposes more
>> opportunities for downstream passes, I notice that even on one
>> target doesn't support vector type and vector type gets expanded
>> in veclower, it's still better to have it, since the generated
>> sequence is more friendly for widening_mul.  (If one more time
>> DCE after forwprop, it should be the same.  )
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk?
> 
> Hmm.  You are basically vectorizing the reduction partially here (note
> basic-block vectorization doesn't handle reductions yet).  So I'm not sure
> whether we should eventually just change op sort order to sort
> v1[1] after v2[0] to sort v1[0] and v2[0] together.  That would be also maybe
> an intermediate step that makes implementing the vectorization part
> "easier"?  I also imagine that the "vectorization" would be profitable even
> if there's just two elements of the vectors summed and the real vectors are
> larger.
> 

Hi Richard,

Sorry, I have no idea about basic-block vectorization (SLP?), did you suggest 
supporting this there rather than in reassoc? Changing op sort order for 
its expected pattern? 

> Note that the code you transform contains no vector operations (just
> element extracts) and thus you need to check for target support before
> creating vector add.
> 

I had the same concern before.  But I thought that if this VECTOR type is valid
on the target, the addition operation should be fundamental on the target, then
it's ok; if it's invalid, then veclower will expand it into scalar type as well
as the addition operation. So it looks fine to guard it. Does it make sense?


> This is for GCC 10.  Also it doesn't fix PR88497, does it?
> 

Yes, it's in low priority and for GCC10. I think it does. Did I miss something?

For comment 1 and comment 7 cases, trunk gcc generates the expected code with 
-Ofast.
There isn't any issues for the original loop. But for the expanded code in 
comment 1 
(manually expanded case), gcc can't generate better code with multiply-add.

>From comment 9 case (same as comment 1 expanded version):

without this patch, optimized tree and final asm:

  <bb 2> [local count: 1073741824]:
  _1 = *arg2_26(D);
  _2 = *arg3_27(D);
  _3 = _1 * _2;
  _4 = BIT_FIELD_REF <_3, 64, 0>;
  _5 = BIT_FIELD_REF <_3, 64, 64>;
  _18 = _4 + _5;
  _7 = MEM[(__vector double *)arg2_26(D) + 16B];
  _8 = MEM[(__vector double *)arg3_27(D) + 16B];
  _9 = _7 * _8;
  _10 = BIT_FIELD_REF <_9, 64, 0>;
  _11 = BIT_FIELD_REF <_9, 64, 64>;
  _31 = _10 + _11;
  _29 = _18 + _31;
  _13 = MEM[(__vector double *)arg2_26(D) + 32B];
  _14 = MEM[(__vector double *)arg3_27(D) + 32B];
  _15 = _13 * _14;
  _16 = BIT_FIELD_REF <_15, 64, 0>;
  _17 = BIT_FIELD_REF <_15, 64, 64>;
  _6 = _16 + _17;
  _12 = _6 + accumulator_28(D);
  _37 = _12 + _29;
  _19 = MEM[(__vector double *)arg2_26(D) + 48B];
  _20 = MEM[(__vector double *)arg3_27(D) + 48B];
  _21 = _19 * _20;
  _22 = BIT_FIELD_REF <_21, 64, 0>;
  _23 = BIT_FIELD_REF <_21, 64, 64>;
  _24 = _22 + _23;
  accumulator_32 = _24 + _37;
  return accumulator_32;

0000000000000000 <foo>:
   0:   01 00 e5 f4     lxv     vs7,0(r5)
   4:   11 00 05 f4     lxv     vs0,16(r5)
   8:   21 00 24 f5     lxv     vs9,32(r4)
   c:   21 00 c5 f4     lxv     vs6,32(r5)
  10:   01 00 44 f5     lxv     vs10,0(r4)
  14:   11 00 64 f5     lxv     vs11,16(r4)
  18:   31 00 05 f5     lxv     vs8,48(r5)
  1c:   31 00 84 f5     lxv     vs12,48(r4)
  20:   80 33 29 f1     xvmuldp vs9,vs9,vs6
  24:   80 3b 4a f1     xvmuldp vs10,vs10,vs7
  28:   80 03 6b f1     xvmuldp vs11,vs11,vs0
  2c:   80 43 8c f1     xvmuldp vs12,vs12,vs8
  30:   50 4b 09 f1     xxspltd vs8,vs9,1
  34:   50 5b eb f0     xxspltd vs7,vs11,1
  38:   50 53 0a f0     xxspltd vs0,vs10,1
  3c:   2a 48 28 fd     fadd    f9,f8,f9
  40:   2a 58 67 fd     fadd    f11,f7,f11
  44:   2a 50 00 fc     fadd    f0,f0,f10
  48:   50 63 4c f1     xxspltd vs10,vs12,1
  4c:   2a 60 8a fd     fadd    f12,f10,f12
  50:   2a 08 29 fd     fadd    f9,f9,f1
  54:   2a 58 00 fc     fadd    f0,f0,f11
  58:   2a 48 20 fc     fadd    f1,f0,f9
  5c:   2a 08 2c fc     fadd    f1,f12,f1
  60:   20 00 80 4e     blr


with this patch, optimized tree and final asm:

  _1 = *arg2_26(D);
  _2 = *arg3_27(D);
  _7 = MEM[(__vector double *)arg2_26(D) + 16B];
  _8 = MEM[(__vector double *)arg3_27(D) + 16B];
  _9 = _7 * _8;
  _5 = .FMA (_1, _2, _9);
  _13 = MEM[(__vector double *)arg2_26(D) + 32B];
  _14 = MEM[(__vector double *)arg3_27(D) + 32B];
  _19 = MEM[(__vector double *)arg2_26(D) + 48B];
  _20 = MEM[(__vector double *)arg3_27(D) + 48B];
  _21 = _19 * _20;
  _10 = .FMA (_13, _14, _21);
  _41 = _5 + _10;
  _43 = BIT_FIELD_REF <_41, 64, 64>;
  _42 = BIT_FIELD_REF <_41, 64, 0>;
  _44 = _42 + _43;
  accumulator_32 = accumulator_28(D) + _44;
  return accumulator_32;

0000000000000000 <foo>:
   0:   11 00 04 f4     lxv     vs0,16(r4)
   4:   11 00 c5 f4     lxv     vs6,16(r5)
   8:   31 00 44 f5     lxv     vs10,48(r4)
   c:   31 00 e5 f4     lxv     vs7,48(r5)
  10:   01 00 84 f5     lxv     vs12,0(r4)
  14:   01 00 05 f5     lxv     vs8,0(r5)
  18:   21 00 64 f5     lxv     vs11,32(r4)
  1c:   21 00 25 f5     lxv     vs9,32(r5)
  20:   80 33 00 f0     xvmuldp vs0,vs0,vs6
  24:   80 3b 4a f1     xvmuldp vs10,vs10,vs7
  28:   08 43 0c f0     xvmaddadp vs0,vs12,vs8
  2c:   90 54 8a f1     xxlor   vs12,vs10,vs10
  30:   08 4b 8b f1     xvmaddadp vs12,vs11,vs9
  34:   00 63 00 f0     xvadddp vs0,vs0,vs12
  38:   50 03 80 f1     xxspltd vs12,vs0,1
  3c:   2a 00 0c fc     fadd    f0,f12,f0
  40:   2a 08 20 fc     fadd    f1,f0,f1
  44:   20 00 80 4e     blr

> Richard.
> 
>> Thanks in advance!
>>
>>
>> gcc/ChangeLog
>>
>> 2019-03-08  Kewen Lin  <li...@gcc.gnu.org>
>>
>>         PR target/88497
>>         * tree-ssa-reassoc.c (reassociate_bb): Swap the positions of
>>         GIMPLE_BINARY_RHS check and gimple_visited_p check, call new
>>         function undistribute_bitref_for_vector.
>>         (undistribute_bitref_for_vector): New function.
>>         (cleanup_vinfo_map): Likewise.
>>         (unsigned_cmp): Likewise.
>>
>> gcc/testsuite/ChangeLog
>>
>> 2019-03-08  Kewen Lin  <li...@gcc.gnu.org>
>>
>>         * gcc.dg/tree-ssa/pr88497.c: New test.
>>
>> ---
>>  gcc/testsuite/gcc.dg/tree-ssa/pr88497.c |  18 +++
>>  gcc/tree-ssa-reassoc.c                  | 274 
>> +++++++++++++++++++++++++++++++-
>>  2 files changed, 287 insertions(+), 5 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr88497.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c 
>> b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c
>> new file mode 100644
>> index 0000000..4d9ac67
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c
>> @@ -0,0 +1,18 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -ffast-math -fdump-tree-reassoc1" } */
>> +typedef double v2df __attribute__ ((vector_size (16)));
>> +double
>> +test (double accumulator, v2df arg1[], v2df arg2[])
>> +{
>> +  v2df temp;
>> +  temp = arg1[0] * arg2[0];
>> +  accumulator += temp[0] + temp[1];
>> +  temp = arg1[1] * arg2[1];
>> +  accumulator += temp[0] + temp[1];
>> +  temp = arg1[2] * arg2[2];
>> +  accumulator += temp[0] + temp[1];
>> +  temp = arg1[3] * arg2[3];
>> +  accumulator += temp[0] + temp[1];
>> +  return accumulator;
>> +}
>> +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 2 "reassoc1" } } */
>> diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c
>> index e1c4dfe..fc0e297 100644
>> --- a/gcc/tree-ssa-reassoc.c
>> +++ b/gcc/tree-ssa-reassoc.c
>> @@ -1772,6 +1772,263 @@ undistribute_ops_list (enum tree_code opcode,
>>    return changed;
>>  }
>>
>> +/* Hold the information of one specific VECTOR_TYPE SSA_NAME.
>> +    - offsets: for different BIT_FIELD_REF offsets accessing same VECTOR.
>> +    - ops_indexes: the index of vec ops* for each relavant BIT_FIELD_REF.  
>> */
>> +struct v_info
>> +{
>> +  auto_vec<unsigned HOST_WIDE_INT, 32> offsets;
>> +  auto_vec<unsigned, 32> ops_indexes;
>> +};
>> +
>> +typedef struct v_info *v_info_ptr;
>> +
>> +/* Comparison function for qsort on unsigned BIT_FIELD_REF offsets.  */
>> +static int
>> +unsigned_cmp (const void *p_i, const void *p_j)
>> +{
>> +  if (*(const unsigned *) p_i >= *(const unsigned *) p_j)
>> +    return 1;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Cleanup hash map for VECTOR information.  */
>> +static void
>> +cleanup_vinfo_map (hash_map<tree, v_info_ptr> &info_map)
>> +{
>> +  for (hash_map<tree, v_info_ptr>::iterator it = info_map.begin ();
>> +       it != info_map.end (); ++it)
>> +    {
>> +      v_info_ptr info = (*it).second;
>> +      delete info;
>> +      (*it).second = NULL;
>> +    }
>> +}
>> +
>> +/* Perform un-distribution of BIT_FIELD_REF on VECTOR_TYPE.
>> +     V1[0] + V1[1] + ... + V1[k] + V2[0] + V2[1] + ... + V2[k] + ... Vn[k]
>> +   is transformed to
>> +     Vs = (V1 + V2 + ... + Vn)
>> +     Vs[0] + Vs[1] + ... + Vs[k]
>> +
>> +   The basic steps are listed below:
>> +
>> +    1) Check the addition chain *OPS by looking those summands coming from
>> +       VECTOR bit_field_ref on VECTOR type. Put the information into
>> +       v_info_map for each satisfied summand, using VECTOR SSA_NAME as key.
>> +
>> +    2) For each key (VECTOR SSA_NAME), validate all its BIT_FIELD_REFs are
>> +       continous, they can cover the whole VECTOR perfectly without any 
>> holes.
>> +       Obtain one VECTOR list which contain candidates to be transformed.
>> +
>> +    3) Build the addition statements for all VECTOR candidates, generate
>> +       BIT_FIELD_REFs accordingly.
>> +
>> +   TODO: Now the implementation restrict all candidate VECTORs should have 
>> the
>> +   same VECTOR type, it can be extended into different groups by VECTOR 
>> types
>> +   in future if any profitable cases found.  */
>> +static bool
>> +undistribute_bitref_for_vector (enum tree_code opcode, vec<operand_entry *> 
>> *ops,
>> +                            struct loop *loop)
>> +{
>> +  if (ops->length () <= 1 || opcode != PLUS_EXPR)
>> +    return false;
>> +
>> +  hash_map<tree, v_info_ptr> v_info_map;
>> +  operand_entry *oe1;
>> +  unsigned i;
>> +
>> +  /* Find those summands from VECTOR BIT_FIELD_REF in addition chain, put 
>> the
>> +     information into map.  */
>> +  FOR_EACH_VEC_ELT (*ops, i, oe1)
>> +    {
>> +      enum tree_code dcode;
>> +      gimple *oe1def;
>> +
>> +      if (TREE_CODE (oe1->op) != SSA_NAME)
>> +       continue;
>> +      oe1def = SSA_NAME_DEF_STMT (oe1->op);
>> +      if (!is_gimple_assign (oe1def))
>> +       continue;
>> +      dcode = gimple_assign_rhs_code (oe1def);
>> +      if (dcode != BIT_FIELD_REF || !is_reassociable_op (oe1def, dcode, 
>> loop))
>> +       continue;
>> +
>> +      tree rhs = gimple_op (oe1def, 1);
>> +      tree op0 = TREE_OPERAND (rhs, 0);
>> +
>> +      if (TREE_CODE (op0) != SSA_NAME
>> +         || TREE_CODE (TREE_TYPE (op0)) != VECTOR_TYPE)
>> +       continue;
>> +
>> +      tree op1 = TREE_OPERAND (rhs, 1);
>> +      tree op2 = TREE_OPERAND (rhs, 2);
>> +
>> +      tree elem_type = TREE_TYPE (TREE_TYPE (op0));
>> +      unsigned HOST_WIDE_INT size = TREE_INT_CST_LOW (TYPE_SIZE 
>> (elem_type));
>> +      if (size != TREE_INT_CST_LOW (op1))
>> +       continue;
>> +
>> +      v_info_ptr *info_ptr = v_info_map.get (op0);
>> +      if (info_ptr)
>> +       {
>> +         v_info_ptr info = *info_ptr;
>> +         info->offsets.safe_push (TREE_INT_CST_LOW (op2));
>> +         info->ops_indexes.safe_push (i);
>> +       }
>> +      else
>> +       {
>> +         v_info_ptr info = new v_info;
>> +         info->offsets.safe_push (TREE_INT_CST_LOW (op2));
>> +         info->ops_indexes.safe_push (i);
>> +         v_info_map.put (op0, info);
>> +       }
>> +    }
>> +
>> +  /* At least two VECTOR to combine.  */
>> +  if (v_info_map.elements () <= 1)
>> +    {
>> +      cleanup_vinfo_map (v_info_map);
>> +      return false;
>> +    }
>> +
>> +  /* Use the first VECTOR and its information as the reference.
>> +     Firstly, we should validate it, that is:
>> +       1) sorted offsets are adjacent, no holes.
>> +       2) can fill the whole VECTOR perfectly.  */
>> +  hash_map<tree, v_info_ptr>::iterator it = v_info_map.begin ();
>> +  tree ref_vec = (*it).first;
>> +  v_info_ptr ref_info = (*it).second;
>> +  ref_info->offsets.qsort (unsigned_cmp);
>> +  tree elem_type = TREE_TYPE (TREE_TYPE (ref_vec));
>> +  unsigned HOST_WIDE_INT elem_size = TREE_INT_CST_LOW (TYPE_SIZE 
>> (elem_type));
>> +  unsigned HOST_WIDE_INT curr;
>> +  unsigned HOST_WIDE_INT prev = ref_info->offsets[0];
>> +
>> +  FOR_EACH_VEC_ELT_FROM (ref_info->offsets, i, curr, 1)
>> +    {
>> +      /* Continous check.  */
>> +      if (curr != (prev + elem_size))
>> +       {
>> +         cleanup_vinfo_map (v_info_map);
>> +         return false;
>> +       }
>> +      prev = curr;
>> +    }
>> +
>> +  /* Check whether fill the whole.  */
>> +  if ((prev + elem_size) != TREE_INT_CST_LOW (TYPE_SIZE (TREE_TYPE 
>> (ref_vec))))
>> +    {
>> +      cleanup_vinfo_map (v_info_map);
>> +      return false;
>> +    }
>> +
>> +  auto_vec<tree> vectors (v_info_map.elements ());
>> +  vectors.quick_push (ref_vec);
>> +
>> +  /* Use the ref_vec to filter others.  */
>> +  for (++it; it != v_info_map.end (); ++it)
>> +    {
>> +      tree vec = (*it).first;
>> +      v_info_ptr info = (*it).second;
>> +      if (TREE_TYPE (ref_vec) != TREE_TYPE (vec))
>> +       continue;
>> +      if (ref_info->offsets.length () != info->offsets.length ())
>> +       continue;
>> +      bool same_offset = true;
>> +      info->offsets.qsort (unsigned_cmp);
>> +      for (unsigned i = 0; i < ref_info->offsets.length (); i++)
>> +       {
>> +         if (ref_info->offsets[i] != info->offsets[i])
>> +           {
>> +             same_offset = false;
>> +             break;
>> +           }
>> +       }
>> +      if (!same_offset)
>> +       continue;
>> +      vectors.quick_push (vec);
>> +    }
>> +
>> +  if (vectors.length () < 2)
>> +    {
>> +      cleanup_vinfo_map (v_info_map);
>> +      return false;
>> +    }
>> +
>> +  tree tr;
>> +  if (dump_file && (dump_flags & TDF_DETAILS))
>> +    {
>> +      fprintf (dump_file, "The bit_field_ref vector list for summation: ");
>> +      FOR_EACH_VEC_ELT (vectors, i, tr)
>> +       {
>> +         print_generic_expr (dump_file, tr);
>> +         fprintf (dump_file, "  ");
>> +       }
>> +      fprintf (dump_file, "\n");
>> +    }
>> +
>> +  /* Build the sum for all candidate VECTORs.  */
>> +  unsigned idx;
>> +  gimple *sum = NULL;
>> +  v_info_ptr info;
>> +  tree sum_vec = ref_vec;
>> +  FOR_EACH_VEC_ELT_FROM (vectors, i, tr, 1)
>> +    {
>> +      sum = build_and_add_sum (TREE_TYPE (ref_vec), sum_vec, tr, opcode);
>> +      info = *(v_info_map.get (tr));
>> +      unsigned j;
>> +      FOR_EACH_VEC_ELT (info->ops_indexes, j, idx)
>> +       {
>> +         gimple *def = SSA_NAME_DEF_STMT ((*ops)[idx]->op);
>> +         gimple_set_visited (def, true);
>> +         (*ops)[idx]->op = build_zero_cst (TREE_TYPE ((*ops)[idx]->op));
>> +         (*ops)[idx]->rank = 0;
>> +       }
>> +      sum_vec = gimple_get_lhs (sum);
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       {
>> +         fprintf (dump_file, "Generating addition -> ");
>> +         print_gimple_stmt (dump_file, sum, 0);
>> +       }
>> +    }
>> +
>> +  /* Referring to any good shape vector (here using ref_vec), generate the
>> +     BIT_FIELD_REF statements accordingly.  */
>> +  info = *(v_info_map.get (ref_vec));
>> +  gcc_assert (sum);
>> +  FOR_EACH_VEC_ELT (info->ops_indexes, i, idx)
>> +    {
>> +      tree dst = make_ssa_name (elem_type);
>> +      gimple *gs
>> +       = gimple_build_assign (dst, BIT_FIELD_REF,
>> +                              build3 (BIT_FIELD_REF, elem_type, sum_vec,
>> +                                      TYPE_SIZE (elem_type),
>> +                                      bitsize_int (info->offsets[i])));
>> +      insert_stmt_after (gs, sum);
>> +      update_stmt (gs);
>> +      gimple *def = SSA_NAME_DEF_STMT ((*ops)[idx]->op);
>> +      gimple_set_visited (def, true);
>> +      (*ops)[idx]->op = gimple_assign_lhs (gs);
>> +      (*ops)[idx]->rank = get_rank ((*ops)[idx]->op);
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       {
>> +         fprintf (dump_file, "Generating bit_field_ref -> ");
>> +         print_gimple_stmt (dump_file, gs, 0);
>> +       }
>> +    }
>> +
>> +  if (dump_file && (dump_flags & TDF_DETAILS))
>> +    {
>> +      fprintf (dump_file, "undistributiong bit_field_ref for vector 
>> done.\n");
>> +    }
>> +
>> +  cleanup_vinfo_map (v_info_map);
>> +
>> +  return true;
>> +}
>> +
>>  /* If OPCODE is BIT_IOR_EXPR or BIT_AND_EXPR and CURR is a comparison
>>     expression, examine the other OPS to see if any of them are comparisons
>>     of the same values, which we may be able to combine or eliminate.
>> @@ -5880,11 +6137,6 @@ reassociate_bb (basic_block bb)
>>           tree lhs, rhs1, rhs2;
>>           enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
>>
>> -         /* If this is not a gimple binary expression, there is
>> -            nothing for us to do with it.  */
>> -         if (get_gimple_rhs_class (rhs_code) != GIMPLE_BINARY_RHS)
>> -           continue;
>> -
>>           /* If this was part of an already processed statement,
>>              we don't need to touch it again. */
>>           if (gimple_visited_p (stmt))
>> @@ -5911,6 +6163,11 @@ reassociate_bb (basic_block bb)
>>               continue;
>>             }
>>
>> +         /* If this is not a gimple binary expression, there is
>> +            nothing for us to do with it.  */
>> +         if (get_gimple_rhs_class (rhs_code) != GIMPLE_BINARY_RHS)
>> +           continue;
>> +
>>           lhs = gimple_assign_lhs (stmt);
>>           rhs1 = gimple_assign_rhs1 (stmt);
>>           rhs2 = gimple_assign_rhs2 (stmt);
>> @@ -5950,6 +6207,13 @@ reassociate_bb (basic_block bb)
>>                   optimize_ops_list (rhs_code, &ops);
>>                 }
>>
>> +             if (undistribute_bitref_for_vector (rhs_code, &ops,
>> +                                                 loop_containing_stmt 
>> (stmt)))
>> +               {
>> +                 ops.qsort (sort_by_operand_rank);
>> +                 optimize_ops_list (rhs_code, &ops);
>> +               }
>> +
>>               if (rhs_code == PLUS_EXPR
>>                   && transform_add_to_multiply (&ops))
>>                 ops.qsort (sort_by_operand_rank);
>> --
>> 2.7.4
>>
> 

Reply via email to