Re: Add support for in-order addition reduction using SVE FADDA

Richard Biener Wed, 10 Jan 2018 05:12:49 -0800

On Tue, Jan 9, 2018 at 4:36 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> Ping


Ok.

Richard.

> Richard Sandiford <richard.sandif...@linaro.org> writes:
>> Richard Biener <richard.guent...@gmail.com> writes:
>>> On Mon, Nov 20, 2017 at 1:54 PM, Richard Sandiford
>>> <richard.sandif...@linaro.org> wrote:
>>>> Richard Biener <richard.guent...@gmail.com> writes:
>>>>> On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford
>>>>> <richard.sandif...@linaro.org> wrote:
>>>>>> This patch adds support for in-order floating-point addition reductions,
>>>>>> which are suitable even in strict IEEE mode.
>>>>>>
>>>>>> Previously vect_is_simple_reduction would reject any cases that forbid
>>>>>> reassociation.  The idea is instead to tentatively accept them as
>>>>>> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target
>>>>>> support for them.  Although this patch only handles the particular
>>>>>> case of plus and minus on floating-point types, there's no reason in
>>>>>> principle why targets couldn't handle other cases.
>>>>>>
>>>>>> The vect_force_simple_reduction change makes it simpler for parloops
>>>>>> to read the type of reduction.
>>>>>>
>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>>> and powerpc64le-linux-gnu.  OK to install?
>>>>>
>>>>> I don't like that you add a new tree code for this.  A new IFN looks more
>>>>> suitable to me.
>>>>
>>>> OK.
>>>
>>> Thanks.  I'd like to eventually get rid of other vectorizer tree codes as 
>>> well,
>>> like the REDUC_*_EXPR, DOT_PROD_EXPR and SAD_EXPR.  IFNs
>>> are now really the way to go for "target instructions on GIMPLE".
>>>
>>>>> Also I think if there's a way to handle this correctly with target support
>>>>> you can also implement a fallback if there is no such support increasing
>>>>> test coverage.  It would basically boil down to extracting all scalars 
>>>>> from
>>>>> the non-reduction operand vector and performing a series of reduction
>>>>> ops, keeping the reduction PHI scalar.  This would also support any
>>>>> reduction operator.
>>>>
>>>> Yeah, but without target support, that's probably going to be expensive.
>>>> It's a bit like how we can implement element-by-element loads and stores
>>>> for cases that don't have target support, but had to explicitly disable
>>>> that in many cases, since the cost model was too optimistic.
>>>
>>> I expect that for V2DF or even V4DF it might be profitable in quite a number
>>> of cases.  V2DF definitely.
>>>
>>>> I can give it a go anyway if you think it's worth it.
>>>
>>> I think it is.
>>
>> OK, done in the patch below.  Tested as before.
>>
>> Thanks,
>> Richard
>
> 2017-11-21  Richard Sandiford  <richard.sandif...@linaro.org>
>             Alan Hayward  <alan.hayw...@arm.com>
>             David Sherwood  <david.sherw...@arm.com>
>
> gcc/
>         * optabs.def (fold_left_plus_optab): New optab.
>         * doc/md.texi (fold_left_plus_@var{m}): Document.
>         * internal-fn.def (IFN_FOLD_LEFT_PLUS): New internal function.
>         * internal-fn.c (fold_left_direct): Define.
>         (expand_fold_left_optab_fn): Likewise.
>         (direct_fold_left_optab_supported_p): Likewise.
>         * fold-const-call.c (fold_const_fold_left): New function.
>         (fold_const_call): Use it to fold CFN_FOLD_LEFT_PLUS.
>         * tree-parloops.c (valid_reduction_p): New function.
>         (gather_scalar_reductions): Use it.
>         * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
>         (vect_finish_replace_stmt): Declare.
>         * tree-vect-loop.c (fold_left_reduction_code): New function.
>         (needs_fold_left_reduction_p): New function, split out from...
>         (vect_is_simple_reduction): ...here.  Accept reductions that
>         forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
>         (vect_force_simple_reduction): Also store the reduction type in
>         the assignment's STMT_VINFO_REDUC_TYPE.
>         (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
>         (merge_with_identity): New function.
>         (vectorize_fold_left_reduction): Likewise.
>         (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
>         scalar phi in place for it.  Check for target support and reject
>         cases that would reassociate the operation.  Defer the transform
>         phase to vectorize_fold_left_reduction.
>         * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
>         * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
>         (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.
>
> gcc/testsuite/
>         * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass and
>         check for a message about using in-order reductions.
>         * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized and
>         check for a message about using in-order reductions.
>         * gcc.dg/vect/trapv-vect-reduc-4.c: Expect all three loops to be
>         vectorized and check for a message about using in-order reductions.
>         * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized and
>         check for a message about using in-order reductions.
>         * gcc.dg/vect/vect-reduc-in-order-1.c: New test.
>         * gcc.dg/vect/vect-reduc-in-order-2.c: Likewise.
>         * gcc.dg/vect/vect-reduc-in-order-3.c: Likewise.
>         * gcc.dg/vect/vect-reduc-in-order-4.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_1.c: New test.
>         * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
>         * gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
>         * gfortran.dg/vect/vect-8.f90: Expect 22 loops to be vectorized if
>         vect_fold_left_plus.
>
> Index: gcc/optabs.def
> ===================================================================
> --- gcc/optabs.def      2017-11-21 17:06:24.670434749 +0000
> +++ gcc/optabs.def      2017-11-21 17:06:25.015421374 +0000
> @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u
>  OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
>  OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
>  OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
> +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
>
>  OPTAB_D (extract_last_optab, "extract_last_$a")
>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi     2017-11-21 17:06:24.670434749 +0000
> +++ gcc/doc/md.texi     2017-11-21 17:06:25.014421412 +0000
> @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha
>  one element of @var{m}.  Operand 2 has the usual mask mode for vectors
>  of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
>
> +@cindex @code{fold_left_plus_@var{m}} instruction pattern
> +@item @code{fold_left_plus_@var{m}}
> +Take scalar operand 1 and successively add each element from vector
> +operand 2.  Store the result in scalar operand 0.  The vector has
> +mode @var{m} and the scalars have the mode appropriate for one
> +element of @var{m}.  The operation is strictly in-order: there is
> +no reassociation.
> +
>  @cindex @code{sdot_prod@var{m}} instruction pattern
>  @item @samp{sdot_prod@var{m}}
>  @cindex @code{udot_prod@var{m}} instruction pattern
> Index: gcc/internal-fn.def
> ===================================================================
> --- gcc/internal-fn.def 2017-11-21 17:06:24.670434749 +0000
> +++ gcc/internal-fn.def 2017-11-21 17:06:25.015421374 +0000
> @@ -59,6 +59,8 @@ along with GCC; see the file COPYING3.
>
>     - cond_binary: a conditional binary optab, such as add<mode>cc
>
> +   - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
> +
>     DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
>     maps to one of two optabs, depending on the signedness of an input.
>     SIGNED_OPTAB and UNSIGNED_OPTAB are the optabs for signed and
> @@ -177,6 +179,8 @@ DEF_INTERNAL_OPTAB_FN (EXTRACT_LAST, ECF
>  DEF_INTERNAL_OPTAB_FN (FOLD_EXTRACT_LAST, ECF_CONST | ECF_NOTHROW,
>                        fold_extract_last, fold_extract)
>
> +DEF_INTERNAL_OPTAB_FN (FOLD_LEFT_PLUS, ECF_CONST | ECF_NOTHROW,
> +                      fold_left_plus, fold_left)
>
>  /* Unary math functions.  */
>  DEF_INTERNAL_FLT_FN (ACOS, ECF_CONST, acos, unary)
> Index: gcc/internal-fn.c
> ===================================================================
> --- gcc/internal-fn.c   2017-11-21 17:06:24.670434749 +0000
> +++ gcc/internal-fn.c   2017-11-21 17:06:25.015421374 +0000
> @@ -92,6 +92,7 @@ #define cond_unary_direct { 1, 1, true }
>  #define cond_binary_direct { 1, 1, true }
>  #define while_direct { 0, 2, false }
>  #define fold_extract_direct { 2, 2, false }
> +#define fold_left_direct { 1, 1, false }
>
>  const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
>  #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
> @@ -2839,6 +2840,9 @@ #define expand_cond_binary_optab_fn(FN,
>  #define expand_fold_extract_optab_fn(FN, STMT, OPTAB) \
>    expand_direct_optab_fn (FN, STMT, OPTAB, 3)
>
> +#define expand_fold_left_optab_fn(FN, STMT, OPTAB) \
> +  expand_direct_optab_fn (FN, STMT, OPTAB, 2)
> +
>  /* RETURN_TYPE and ARGS are a return type and argument list that are
>     in principle compatible with FN (which satisfies direct_internal_fn_p).
>     Return the types that should be used to determine whether the
> @@ -2922,6 +2926,7 @@ #define direct_store_lanes_optab_support
>  #define direct_mask_store_lanes_optab_supported_p 
> multi_vector_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
>  #define direct_fold_extract_optab_supported_p direct_optab_supported_p
> +#define direct_fold_left_optab_supported_p direct_optab_supported_p
>
>  /* Return the optab used by internal function FN.  */
>
> Index: gcc/fold-const-call.c
> ===================================================================
> --- gcc/fold-const-call.c       2017-11-21 17:06:24.670434749 +0000
> +++ gcc/fold-const-call.c       2017-11-21 17:06:25.014421412 +0000
> @@ -1190,6 +1190,25 @@ fold_const_call (combined_fn fn, tree ty
>      }
>  }
>
> +/* Fold a call to IFN_FOLD_LEFT_<CODE> (ARG0, ARG1), returning a value
> +   of type TYPE.  */
> +
> +static tree
> +fold_const_fold_left (tree type, tree arg0, tree arg1, tree_code code)
> +{
> +  if (TREE_CODE (arg1) != VECTOR_CST)
> +    return NULL_TREE;
> +
> +  unsigned int nelts = VECTOR_CST_NELTS (arg1);
> +  for (unsigned int i = 0; i < nelts; i++)
> +    {
> +      arg0 = const_binop (code, type, arg0, VECTOR_CST_ELT (arg1, i));
> +      if (arg0 == NULL_TREE || !CONSTANT_CLASS_P (arg0))
> +       return NULL_TREE;
> +    }
> +  return arg0;
> +}
> +
>  /* Try to evaluate:
>
>        *RESULT = FN (*ARG0, *ARG1)
> @@ -1495,6 +1514,9 @@ fold_const_call (combined_fn fn, tree ty
>         }
>        return NULL_TREE;
>
> +    case CFN_FOLD_LEFT_PLUS:
> +      return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
> +
>      default:
>        return fold_const_call_1 (fn, type, arg0, arg1);
>      }
> Index: gcc/tree-parloops.c
> ===================================================================
> --- gcc/tree-parloops.c 2017-11-21 17:06:24.670434749 +0000
> +++ gcc/tree-parloops.c 2017-11-21 17:06:25.017421296 +0000
> @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo
>    return 1;
>  }
>
> +/* Return true if the type of reduction performed by STMT is suitable
> +   for this pass.  */
> +
> +static bool
> +valid_reduction_p (gimple *stmt)
> +{
> +  /* Parallelization would reassociate the operation, which isn't
> +     allowed for in-order reductions.  */
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
> +  return reduc_type != FOLD_LEFT_REDUCTION;
> +}
> +
>  /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
>
>  static void
> @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r
>        gimple *reduc_stmt
>         = vect_force_simple_reduction (simple_loop_info, phi,
>                                        &double_reduc, true);
> -      if (!reduc_stmt)
> +      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
>         continue;
>
>        if (double_reduc)
> @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r
>                 = vect_force_simple_reduction (simple_loop_info, inner_phi,
>                                                &double_reduc, true);
>               gcc_assert (!double_reduc);
> -             if (inner_reduc_stmt == NULL)
> +             if (inner_reduc_stmt == NULL
> +                 || !valid_reduction_p (inner_reduc_stmt))
>                 continue;
>
>               build_new_reduction (reduction_list, double_reduc_stmts[i], 
> phi);
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2017-11-21 17:06:24.670434749 +0000
> +++ gcc/tree-vectorizer.h       2017-11-21 17:06:25.018421257 +0000
> @@ -74,7 +74,15 @@ enum vect_reduction_type {
>
>         for (int i = 0; i < VF; ++i)
>           res = cond[i] ? val[i] : res;  */
> -  EXTRACT_LAST_REDUCTION
> +  EXTRACT_LAST_REDUCTION,
> +
> +  /* Use a folding reduction within the loop to implement:
> +
> +       for (int i = 0; i < VF; ++i)
> +         res = res OP val[i];
> +
> +     (with no reassocation).  */
> +  FOLD_LEFT_REDUCTION
>  };
>
>  #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
> @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v
>  extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
>                                   enum vect_cost_for_stmt, stmt_vec_info,
>                                   int, enum vect_cost_model_location);
> +extern void vect_finish_replace_stmt (gimple *, gimple *);
>  extern void vect_finish_stmt_generation (gimple *, gimple *,
>                                           gimple_stmt_iterator *);
>  extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2017-11-21 17:06:24.670434749 +0000
> +++ gcc/tree-vect-loop.c        2017-11-21 17:06:25.018421257 +0000
> @@ -2573,6 +2573,22 @@ vect_analyze_loop (struct loop *loop, lo
>      }
>  }
>
> +/* Return true if there is an in-order reduction function for CODE, storing
> +   it in *REDUC_FN if so.  */
> +
> +static bool
> +fold_left_reduction_fn (tree_code code, internal_fn *reduc_fn)
> +{
> +  switch (code)
> +    {
> +    case PLUS_EXPR:
> +      *reduc_fn = IFN_FOLD_LEFT_PLUS;
> +      return true;
> +
> +    default:
> +      return false;
> +    }
> +}
>
>  /* Function reduction_fn_for_scalar_code
>
> @@ -2879,6 +2895,42 @@ vect_is_slp_reduction (loop_vec_info loo
>    return true;
>  }
>
> +/* Return true if we need an in-order reduction for operation CODE
> +   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer
> +   overflow must wrap.  */
> +
> +static bool
> +needs_fold_left_reduction_p (tree type, tree_code code,
> +                            bool need_wrapping_integral_overflow)
> +{
> +  /* CHECKME: check for !flag_finite_math_only too?  */
> +  if (SCALAR_FLOAT_TYPE_P (type))
> +    switch (code)
> +      {
> +      case MIN_EXPR:
> +      case MAX_EXPR:
> +       return false;
> +
> +      default:
> +       return !flag_associative_math;
> +      }
> +
> +  if (INTEGRAL_TYPE_P (type))
> +    {
> +      if (!operation_no_trapping_overflow (type, code))
> +       return true;
> +      if (need_wrapping_integral_overflow
> +         && !TYPE_OVERFLOW_WRAPS (type)
> +         && operation_can_overflow (code))
> +       return true;
> +      return false;
> +    }
> +
> +  if (SAT_FIXED_POINT_TYPE_P (type))
> +    return true;
> +
> +  return false;
> +}
>
>  /* Function vect_is_simple_reduction
>
> @@ -3197,58 +3249,18 @@ vect_is_simple_reduction (loop_vec_info
>        return NULL;
>      }
>
> -  /* Check that it's ok to change the order of the computation.
> +  /* Check whether it's ok to change the order of the computation.
>       Generally, when vectorizing a reduction we change the order of the
>       computation.  This may change the behavior of the program in some
>       cases, so we need to check that this is ok.  One exception is when
>       vectorizing an outer-loop: the inner-loop is executed sequentially,
>       and therefore vectorizing reductions in the inner-loop during
>       outer-loop vectorization is safe.  */
> -
> -  if (*v_reduc_type != COND_REDUCTION
> -      && check_reduction)
> -    {
> -      /* CHECKME: check for !flag_finite_math_only too?  */
> -      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)
> -       {
> -         /* Changing the order of operations changes the semantics.  */
> -         if (dump_enabled_p ())
> -           report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                       "reduction: unsafe fp math optimization: ");
> -         return NULL;
> -       }
> -      else if (INTEGRAL_TYPE_P (type))
> -       {
> -         if (!operation_no_trapping_overflow (type, code))
> -           {
> -             /* Changing the order of operations changes the semantics.  */
> -             if (dump_enabled_p ())
> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                               "reduction: unsafe int math optimization"
> -                               " (overflow traps): ");
> -             return NULL;
> -           }
> -         if (need_wrapping_integral_overflow
> -             && !TYPE_OVERFLOW_WRAPS (type)
> -             && operation_can_overflow (code))
> -           {
> -             /* Changing the order of operations changes the semantics.  */
> -             if (dump_enabled_p ())
> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                               "reduction: unsafe int math optimization"
> -                               " (overflow doesn't wrap): ");
> -             return NULL;
> -           }
> -       }
> -      else if (SAT_FIXED_POINT_TYPE_P (type))
> -       {
> -         /* Changing the order of operations changes the semantics.  */
> -         if (dump_enabled_p ())
> -         report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                         "reduction: unsafe fixed-point math optimization: 
> ");
> -         return NULL;
> -       }
> -    }
> +  if (check_reduction
> +      && *v_reduc_type == TREE_CODE_REDUCTION
> +      && needs_fold_left_reduction_p (type, code,
> +                                     need_wrapping_integral_overflow))
> +    *v_reduc_type = FOLD_LEFT_REDUCTION;
>
>    /* Reduction is safe. We're dealing with one of the following:
>       1) integer arithmetic and no trapv
> @@ -3512,6 +3524,7 @@ vect_force_simple_reduction (loop_vec_in
>        STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
>        STMT_VINFO_REDUC_DEF (reduc_def_info) = def;
>        reduc_def_info = vinfo_for_stmt (def);
> +      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
>        STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;
>      }
>    return def;
> @@ -4064,14 +4077,27 @@ vect_model_reduction_cost (stmt_vec_info
>
>    code = gimple_assign_rhs_code (orig_stmt);
>
> -  if (reduction_type == EXTRACT_LAST_REDUCTION)
> +  if (reduction_type == EXTRACT_LAST_REDUCTION
> +      || reduction_type == FOLD_LEFT_REDUCTION)
>      {
>        /* No extra instructions needed in the prologue.  */
>        prologue_cost = 0;
>
> -      /* Count NCOPIES FOLD_EXTRACT_LAST operations.  */
> -      inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
> -                                  stmt_info, 0, vect_body);
> +      if (reduction_type == EXTRACT_LAST_REDUCTION || reduc_fn != IFN_LAST)
> +       /* Count one reduction-like operation per vector.  */
> +       inside_cost = add_stmt_cost (target_cost_data, ncopies, vec_to_scalar,
> +                                    stmt_info, 0, vect_body);
> +      else
> +       {
> +         /* Use NELEMENTS extracts and NELEMENTS scalar ops.  */
> +         unsigned int nelements = ncopies * vect_nunits_for_cost (vectype);
> +         inside_cost = add_stmt_cost (target_cost_data,  nelements,
> +                                      vec_to_scalar, stmt_info, 0,
> +                                      vect_body);
> +         inside_cost += add_stmt_cost (target_cost_data,  nelements,
> +                                       scalar_stmt, stmt_info, 0,
> +                                       vect_body);
> +       }
>      }
>    else
>      {
> @@ -4137,7 +4163,8 @@ vect_model_reduction_cost (stmt_vec_info
>                                           scalar_stmt, stmt_info, 0,
>                                           vect_epilogue);
>         }
> -      else if (reduction_type == EXTRACT_LAST_REDUCTION)
> +      else if (reduction_type == EXTRACT_LAST_REDUCTION
> +              || reduction_type == FOLD_LEFT_REDUCTION)
>         /* No extra instructions need in the epilogue.  */
>         ;
>        else
> @@ -5910,6 +5937,160 @@ vect_create_epilog_for_reduction (vec<tr
>      }
>  }
>
> +/* Return a vector of type VECTYPE that is equal to the vector select
> +   operation "MASK ? VEC : IDENTITY".  Insert the select statements
> +   before GSI.  */
> +
> +static tree
> +merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,
> +                    tree vec, tree identity)
> +{
> +  tree cond = make_temp_ssa_name (vectype, NULL, "cond");
> +  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,
> +                                         mask, vec, identity);
> +  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +  return cond;
> +}
> +
> +/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the
> +   statement that sets the live-out value.  REDUC_DEF_STMT is the phi
> +   statement.  CODE is the operation performed by STMT and OPS are
> +   its scalar operands.  REDUC_INDEX is the index of the operand in
> +   OPS that is set by REDUC_DEF_STMT.  REDUC_FN is the function that
> +   implements in-order reduction, or IFN_LAST if we should open-code it.
> +   VECTYPE_IN is the type of the vector input.  MASKS specifies the masks
> +   that should be used to control the operation in a fully-masked loop.  */
> +
> +static bool
> +vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
> +                              gimple **vec_stmt, slp_tree slp_node,
> +                              gimple *reduc_def_stmt,
> +                              tree_code code, internal_fn reduc_fn,
> +                              tree ops[3], tree vectype_in,
> +                              int reduc_index, vec_loop_masks *masks)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
> +  gimple *new_stmt = NULL;
> +
> +  int ncopies;
> +  if (slp_node)
> +    ncopies = 1;
> +  else
> +    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> +
> +  gcc_assert (!nested_in_vect_loop_p (loop, stmt));
> +  gcc_assert (ncopies == 1);
> +  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
> +  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));
> +  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
> +             == FOLD_LEFT_REDUCTION);
> +
> +  if (slp_node)
> +    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
> +                        TYPE_VECTOR_SUBPARTS (vectype_in)));
> +
> +  tree op0 = ops[1 - reduc_index];
> +
> +  int group_size = 1;
> +  gimple *scalar_dest_def;
> +  auto_vec<tree> vec_oprnds0;
> +  if (slp_node)
> +    {
> +      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);
> +      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();
> +      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];
> +    }
> +  else
> +    {
> +      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);
> +      vec_oprnds0.create (1);
> +      vec_oprnds0.quick_push (loop_vec_def0);
> +      scalar_dest_def = stmt;
> +    }
> +
> +  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);
> +  tree scalar_type = TREE_TYPE (scalar_dest);
> +  tree reduc_var = gimple_phi_result (reduc_def_stmt);
> +
> +  int vec_num = vec_oprnds0.length ();
> +  gcc_assert (vec_num == 1 || slp_node);
> +  tree vec_elem_type = TREE_TYPE (vectype_out);
> +  gcc_checking_assert (useless_type_conversion_p (scalar_type, 
> vec_elem_type));
> +
> +  tree vector_identity = NULL_TREE;
> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +    vector_identity = build_zero_cst (vectype_out);
> +
> +  tree scalar_dest_var = vect_create_destination_var (scalar_dest, NULL);
> +  int i;
> +  tree def0;
> +  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
> +    {
> +      tree mask = NULL_TREE;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +       mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
> +
> +      /* Handle MINUS by adding the negative.  */
> +      if (reduc_fn != IFN_LAST && code == MINUS_EXPR)
> +       {
> +         tree negated = make_ssa_name (vectype_out);
> +         new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +         def0 = negated;
> +       }
> +
> +      if (mask)
> +       def0 = merge_with_identity (gsi, mask, vectype_out, def0,
> +                                   vector_identity);
> +
> +      /* On the first iteration the input is simply the scalar phi
> +        result, and for subsequent iterations it is the output of
> +        the preceding operation.  */
> +      if (reduc_fn != IFN_LAST)
> +       {
> +         new_stmt = gimple_build_call_internal (reduc_fn, 2, reduc_var, 
> def0);
> +         /* For chained SLP reductions the output of the previous reduction
> +            operation serves as the input of the next. For the final 
> statement
> +            the output cannot be a temporary - we reuse the original
> +            scalar destination of the last statement.  */
> +         if (i != vec_num - 1)
> +           {
> +             gimple_set_lhs (new_stmt, scalar_dest_var);
> +             reduc_var = make_ssa_name (scalar_dest_var, new_stmt);
> +             gimple_set_lhs (new_stmt, reduc_var);
> +           }
> +       }
> +      else
> +       {
> +         reduc_var = vect_expand_fold_left (gsi, scalar_dest_var, code,
> +                                            reduc_var, def0);
> +         new_stmt = SSA_NAME_DEF_STMT (reduc_var);
> +         /* Remove the statement, so that we can use the same code paths
> +            as for statements that we've just created.  */
> +         gimple_stmt_iterator tmp_gsi = gsi_for_stmt (new_stmt);
> +         gsi_remove (&tmp_gsi, false);
> +       }
> +
> +      if (i == vec_num - 1)
> +       {
> +         gimple_set_lhs (new_stmt, scalar_dest);
> +         vect_finish_replace_stmt (scalar_dest_def, new_stmt);
> +       }
> +      else
> +       vect_finish_stmt_generation (scalar_dest_def, new_stmt, gsi);
> +
> +      if (slp_node)
> +       SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
> +    }
> +
> +  if (!slp_node)
> +    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +
> +  return true;
> +}
>
>  /* Function is_nonwrapping_integer_induction.
>
> @@ -6090,6 +6271,12 @@ vectorizable_reduction (gimple *stmt, gi
>           return true;
>         }
>
> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
> +       /* Leave the scalar phi in place.  Note that checking
> +          STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works
> +          for reductions involving a single statement.  */
> +       return true;
> +
>        gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);
>        if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))
>         reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));
> @@ -6316,6 +6503,14 @@ vectorizable_reduction (gimple *stmt, gi
>       directy used in stmt.  */
>    if (reduc_index == -1)
>      {
> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "in-order reduction chain without SLP.\n");
> +         return false;
> +       }
> +
>        if (orig_stmt)
>         reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);
>        else
> @@ -6535,7 +6730,9 @@ vectorizable_reduction (gimple *stmt, gi
>
>    vect_reduction_type reduction_type
>      = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);
> -  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)
> +  if (orig_stmt
> +      && (reduction_type == TREE_CODE_REDUCTION
> +         || reduction_type == FOLD_LEFT_REDUCTION))
>      {
>        /* This is a reduction pattern: get the vectype from the type of the
>           reduction variable, and get the tree-code from orig_stmt.  */
> @@ -6582,10 +6779,13 @@ vectorizable_reduction (gimple *stmt, gi
>    reduc_fn = IFN_LAST;
>
>    if (reduction_type == TREE_CODE_REDUCTION
> +      || reduction_type == FOLD_LEFT_REDUCTION
>        || reduction_type == INTEGER_INDUC_COND_REDUCTION
>        || reduction_type == CONST_COND_REDUCTION)
>      {
> -      if (reduction_fn_for_scalar_code (orig_code, &reduc_fn))
> +      if (reduction_type == FOLD_LEFT_REDUCTION
> +         ? fold_left_reduction_fn (orig_code, &reduc_fn)
> +         : reduction_fn_for_scalar_code (orig_code, &reduc_fn))
>         {
>           if (reduc_fn != IFN_LAST
>               && !direct_internal_fn_supported_p (reduc_fn, vectype_out,
> @@ -6704,6 +6904,41 @@ vectorizable_reduction (gimple *stmt, gi
>         }
>      }
>
> +  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)
> +    {
> +      /* We can't support in-order reductions of code such as this:
> +
> +          for (int i = 0; i < n1; ++i)
> +            for (int j = 0; j < n2; ++j)
> +              l += a[j];
> +
> +        since GCC effectively transforms the loop when vectorizing:
> +
> +          for (int i = 0; i < n1 / VF; ++i)
> +            for (int j = 0; j < n2; ++j)
> +              for (int k = 0; k < VF; ++k)
> +                l += a[j];
> +
> +        which is a reassociation of the original operation.  */
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "in-order double reduction not supported.\n");
> +
> +      return false;
> +    }
> +
> +  if (reduction_type == FOLD_LEFT_REDUCTION
> +      && slp_node
> +      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      /* We cannot use in-order reductions in this case because there is
> +         an implicit reassociation of the operations involved.  */
> +      if (dump_enabled_p ())
> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "in-order unchained SLP reductions not 
> supported.\n");
> +      return false;
> +    }
> +
>    /* In case of widenning multiplication by a constant, we update the type
>       of the constant to be the type of the other operand.  We check that the
>       constant fits the type in the pattern recognition pass.  */
> @@ -6824,9 +7059,10 @@ vectorizable_reduction (gimple *stmt, gi
>         vect_model_reduction_cost (stmt_info, reduc_fn, ncopies);
>        if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
>         {
> -         if (cond_fn == IFN_LAST
> -             || !direct_internal_fn_supported_p (cond_fn, vectype_in,
> -                                                 OPTIMIZE_FOR_SPEED))
> +         if (reduction_type != FOLD_LEFT_REDUCTION
> +             && (cond_fn == IFN_LAST
> +                 || !direct_internal_fn_supported_p (cond_fn, vectype_in,
> +                                                     OPTIMIZE_FOR_SPEED)))
>             {
>               if (dump_enabled_p ())
>                 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -6846,6 +7082,10 @@ vectorizable_reduction (gimple *stmt, gi
>             vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num,
>                                    vectype_in);
>         }
> +      if (dump_enabled_p ()
> +         && reduction_type == FOLD_LEFT_REDUCTION)
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "using an in-order (fold-left) reduction.\n");
>        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>        return true;
>      }
> @@ -6861,6 +7101,11 @@ vectorizable_reduction (gimple *stmt, gi
>
>    bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>
> +  if (reduction_type == FOLD_LEFT_REDUCTION)
> +    return vectorize_fold_left_reduction
> +      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,
> +       reduc_fn, ops, vectype_in, reduc_index, masks);
> +
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      {
>        gcc_assert (!slp_node);
> Index: gcc/config/aarch64/aarch64.md
> ===================================================================
> --- gcc/config/aarch64/aarch64.md       2017-11-21 17:06:24.670434749 +0000
> +++ gcc/config/aarch64/aarch64.md       2017-11-21 17:06:25.013421451 +0000
> @@ -164,6 +164,7 @@ (define_c_enum "unspec" [
>      UNSPEC_STN
>      UNSPEC_INSR
>      UNSPEC_CLASTB
> +    UNSPEC_FADDA
>  ])
>
>  (define_c_enum "unspecv" [
> Index: gcc/config/aarch64/aarch64-sve.md
> ===================================================================
> --- gcc/config/aarch64/aarch64-sve.md   2017-11-21 17:06:24.670434749 +0000
> +++ gcc/config/aarch64/aarch64-sve.md   2017-11-21 17:06:25.012421490 +0000
> @@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>
>    "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
>  )
>
> +;; Unpredicated in-order FP reductions.
> +(define_expand "fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand")
> +       (unspec:<VEL> [(match_dup 3)
> +                      (match_operand:<VEL> 1 "register_operand")
> +                      (match_operand:SVE_F 2 "register_operand")]
> +                     UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  {
> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
> +  }
> +)
> +
> +;; In-order FP reductions predicated with PTRUE.
> +(define_insn "*fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
> +       (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
> +                      (match_operand:<VEL> 2 "register_operand" "0")
> +                      (match_operand:SVE_F 3 "register_operand" "w")]
> +                     UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
> +)
> +
> +;; Predicated form of the above in-order reduction.
> +(define_insn "*pred_fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
> +       (unspec:<VEL>
> +         [(match_operand:<VEL> 1 "register_operand" "0")
> +          (unspec:SVE_F
> +            [(match_operand:<VPRED> 2 "register_operand" "Upl")
> +             (match_operand:SVE_F 3 "register_operand" "w")
> +             (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
> +            UNSPEC_SEL)]
> +         UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
> +)
> +
>  ;; Unpredicated floating-point addition.
>  (define_expand "add<mode>3"
>    [(set (match_operand:SVE_F 0 "register_operand")
> Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-21 
> 17:06:24.670434749 +0000
> +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-21 
> 17:06:25.015421374 +0000
> @@ -33,5 +33,5 @@ int main (void)
>    return main1 ();
>  }
>
> -/* Requires fast-math.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail 
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/pr79920.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-21 17:06:24.670434749 +0000
> +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-21 17:06:25.015421374 +0000
> @@ -1,5 +1,5 @@
>  /* { dg-do run } */
> -/* { dg-additional-options "-O3" } */
> +/* { dg-additional-options "-O3 -fno-fast-math" } */
>
>  #include "tree-vect.h"
>
> @@ -41,4 +41,5 @@ int main()
>    return 0;
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> { vect_double && { vect_perm && vect_hw_misalign } } } } } */
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target 
> { vect_double && { vect_perm && vect_hw_misalign } } } } } */
> Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-21 
> 17:06:24.670434749 +0000
> +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-21 
> 17:06:25.015421374 +0000
> @@ -46,5 +46,6 @@ int main (void)
>    return 0;
>  }
>
> -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
> -/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target 
> { ! vect_no_int_min_max } } } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target 
> { ! vect_no_int_min_max } } } } */
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-21 17:06:24.670434749 
> +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-21 17:06:25.015421374 
> +0000
> @@ -1,4 +1,5 @@
>  /* { dg-require-effective-target vect_float } */
> +/* { dg-additional-options "-fno-fast-math" } */
>
>  #include <stdarg.h>
>  #include "tree-vect.h"
> @@ -48,6 +49,5 @@ int main (void)
>    return 0;
>  }
>
> -/* need -ffast-math to vectorizer these loops.  */
> -/* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail 
> arm_neon_ok } } } */
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-1.c   2017-11-21 
> 17:06:25.015421374 +0000
> @@ -0,0 +1,42 @@
> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *a, double *b)
> +{
> +  double r = 0, q = 3;
> +  for (int i = 0; i < N; i++)
> +    {
> +      r += a[i];
> +      q -= b[i];
> +    }
> +  return r * q;
> +}
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  double a[N];
> +  double b[N];
> +  double r = 0, q = 3;
> +  for (int i = 0; i < N; i++)
> +    {
> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
> +      b[i] = (i * 0.3) * (i & 1 ? 1 : -1);
> +      r += a[i];
> +      q -= b[i];
> +      asm volatile ("" ::: "memory");
> +    }
> +  double res = reduc_plus_double (a, b);
> +  if (res != r * q)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 2 "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-2.c   2017-11-21 
> 17:06:25.015421374 +0000
> @@ -0,0 +1,44 @@
> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *restrict a, int n)
> +{
> +  double res = 0.0;
> +  for (int i = 0; i < n; i++)
> +    for (int j = 0; j < N; j++)
> +      res += a[i];
> +  return res;
> +}
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  int n = 19;
> +  double a[N];
> +  double r = 0;
> +  for (int i = 0; i < N; i++)
> +    {
> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
> +      asm volatile ("" ::: "memory");
> +    }
> +  for (int i = 0; i < n; i++)
> +    for (int j = 0; j < N; j++)
> +      {
> +       r += a[i];
> +       asm volatile ("" ::: "memory");
> +      }
> +  double res = reduc_plus_double (a, n);
> +  if (res != r)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times {in-order double reduction not 
> supported} 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-3.c   2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,42 @@
> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *a)
> +{
> +  double r = 0;
> +  for (int i = 0; i < N; i += 4)
> +    {
> +      r += a[i] * 2.0;
> +      r += a[i + 1] * 3.0;
> +      r += a[i + 2] * 4.0;
> +      r += a[i + 3] * 5.0;
> +    }
> +  return r;
> +}
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  double a[N];
> +  double r = 0;
> +  for (int i = 0; i < N; i++)
> +    {
> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
> +      r += a[i] * (i % 4 + 2);
> +      asm volatile ("" ::: "memory");
> +    }
> +  double res = reduc_plus_double (a);
> +  if (res != r)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times {using an in-order \(fold-left\) 
> reduction} 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times {vectorizing stmts using SLP} 1 "vect" 
> } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-in-order-4.c   2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,45 @@
> +/* { dg-do run { xfail { { i?86-*-* x86_64-*-* } && ia32 } } } */
> +/* { dg-require-effective-target vect_double } */
> +/* { dg-add-options ieee } */
> +/* { dg-additional-options "-fno-fast-math" } */
> +
> +#include "tree-vect.h"
> +
> +#define N (VECTOR_BITS * 17)
> +
> +double __attribute__ ((noinline, noclone))
> +reduc_plus_double (double *a)
> +{
> +  double r1 = 0;
> +  double r2 = 0;
> +  double r3 = 0;
> +  double r4 = 0;
> +  for (int i = 0; i < N; i += 4)
> +    {
> +      r1 += a[i];
> +      r2 += a[i + 1];
> +      r3 += a[i + 2];
> +      r4 += a[i + 3];
> +    }
> +  return r1 * r2 * r3 * r4;
> +}
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  double a[N];
> +  double r[4] = {};
> +  for (int i = 0; i < N; i++)
> +    {
> +      a[i] = (i * 0.1) * (i & 1 ? 1 : -1);
> +      r[i % 4] += a[i];
> +      asm volatile ("" ::: "memory");
> +    }
> +  double res = reduc_plus_double (a);
> +  if (res != r[0] * r[1] * r[2] * r[3])
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times {in-order unchained SLP reductions not 
> supported} 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vectorizing stmts using SLP} "vect" } } 
> */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c       2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
> +
> +#define DEF_REDUC_PLUS(TYPE)                   \
> +  TYPE __attribute__ ((noinline, noclone))     \
> +  reduc_plus_##TYPE (TYPE *a, TYPE *b)         \
> +  {                                            \
> +    TYPE r = 0, q = 3;                         \
> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
> +      {                                                \
> +       r += a[i];                              \
> +       q -= b[i];                              \
> +      }                                                \
> +    return r * q;                              \
> +  }
> +
> +#define TEST_ALL(T) \
> +  T (_Float16) \
> +  T (float) \
> +  T (double)
> +
> +TEST_ALL (DEF_REDUC_PLUS)
> +
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 2 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c   2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,29 @@
> +/* { dg-do run { target { aarch64_sve_hw } } } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#include "sve_reduc_strict_1.c"
> +
> +#define TEST_REDUC_PLUS(TYPE)                  \
> +  {                                            \
> +    TYPE a[NUM_ELEMS (TYPE)];                  \
> +    TYPE b[NUM_ELEMS (TYPE)];                  \
> +    TYPE r = 0, q = 3;                         \
> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
> +      {                                                \
> +       a[i] = (i * 0.1) * (i & 1 ? 1 : -1);    \
> +       b[i] = (i * 0.3) * (i & 1 ? 1 : -1);    \
> +       r += a[i];                              \
> +       q -= b[i];                              \
> +       asm volatile ("" ::: "memory");         \
> +      }                                                \
> +    TYPE res = reduc_plus_##TYPE (a, b);       \
> +    if (res != r * q)                          \
> +      __builtin_abort ();                      \
> +  }
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  TEST_ALL (TEST_REDUC_PLUS);
> +  return 0;
> +}
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c       2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
> +
> +#define DEF_REDUC_PLUS(TYPE)                                   \
> +void __attribute__ ((noinline, noclone))                       \
> +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS (TYPE)],       \
> +                  TYPE *restrict r, int n)                     \
> +{                                                              \
> +  for (int i = 0; i < n; i++)                                  \
> +    {                                                          \
> +      r[i] = 0;                                                        \
> +      for (int j = 0; j < NUM_ELEMS (TYPE); j++)               \
> +        r[i] += a[i][j];                                       \
> +    }                                                          \
> +}
> +
> +#define TEST_ALL(T) \
> +  T (_Float16) \
> +  T (float) \
> +  T (double)
> +
> +TEST_ALL (DEF_REDUC_PLUS)
> +
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 1 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c   2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,31 @@
> +/* { dg-do run { target { aarch64_sve_hw } } } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
> +
> +#include "sve_reduc_strict_2.c"
> +
> +#define NROWS 5
> +
> +#define TEST_REDUC_PLUS(TYPE)                                  \
> +  {                                                            \
> +    TYPE a[NROWS][NUM_ELEMS (TYPE)];                           \
> +    TYPE r[NROWS];                                             \
> +    TYPE expected[NROWS] = {};                                 \
> +    for (int i = 0; i < NROWS; ++i)                            \
> +      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)               \
> +       {                                                       \
> +         a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);     \
> +         expected[i] += a[i][j];                               \
> +         asm volatile ("" ::: "memory");                       \
> +       }                                                       \
> +    reduc_plus_##TYPE (a, r, NROWS);                           \
> +    for (int i = 0; i < NROWS; ++i)                            \
> +      if (r[i] != expected[i])                                 \
> +       __builtin_abort ();                                     \
> +  }
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  TEST_ALL (TEST_REDUC_PLUS);
> +  return 0;
> +}
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c
> ===================================================================
> --- /dev/null   2017-11-20 18:51:34.589640877 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c       2017-11-21 
> 17:06:25.016421335 +0000
> @@ -0,0 +1,131 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve 
> -msve-vector-bits=256 -fdump-tree-vect-details" } */
> +
> +double mat[100][4];
> +double mat2[100][8];
> +double mat3[100][12];
> +double mat4[100][3];
> +
> +double
> +slp_reduc_plus (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat[i][0];
> +      tmp = tmp + mat[i][1];
> +      tmp = tmp + mat[i][2];
> +      tmp = tmp + mat[i][3];
> +    }
> +  return tmp;
> +}
> +
> +double
> +slp_reduc_plus2 (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat2[i][0];
> +      tmp = tmp + mat2[i][1];
> +      tmp = tmp + mat2[i][2];
> +      tmp = tmp + mat2[i][3];
> +      tmp = tmp + mat2[i][4];
> +      tmp = tmp + mat2[i][5];
> +      tmp = tmp + mat2[i][6];
> +      tmp = tmp + mat2[i][7];
> +    }
> +  return tmp;
> +}
> +
> +double
> +slp_reduc_plus3 (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat3[i][0];
> +      tmp = tmp + mat3[i][1];
> +      tmp = tmp + mat3[i][2];
> +      tmp = tmp + mat3[i][3];
> +      tmp = tmp + mat3[i][4];
> +      tmp = tmp + mat3[i][5];
> +      tmp = tmp + mat3[i][6];
> +      tmp = tmp + mat3[i][7];
> +      tmp = tmp + mat3[i][8];
> +      tmp = tmp + mat3[i][9];
> +      tmp = tmp + mat3[i][10];
> +      tmp = tmp + mat3[i][11];
> +    }
> +  return tmp;
> +}
> +
> +void
> +slp_non_chained_reduc (int n, double * restrict out)
> +{
> +  for (int i = 0; i < 3; i++)
> +    out[i] = 0;
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      out[0] = out[0] + mat4[i][0];
> +      out[1] = out[1] + mat4[i][1];
> +      out[2] = out[2] + mat4[i][2];
> +    }
> +}
> +
> +/* Strict FP reductions shouldn't be used for the outer loops, only the
> +   inner loops.  */
> +
> +float
> +double_reduc1 (float (*restrict i)[16])
> +{
> +  float l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 8; b++)
> +      l += i[b][a];
> +  return l;
> +}
> +
> +float
> +double_reduc2 (float *restrict i)
> +{
> +  float l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 16; b++)
> +      {
> +        l += i[b * 4];
> +        l += i[b * 4 + 1];
> +        l += i[b * 4 + 2];
> +        l += i[b * 4 + 3];
> +      }
> +  return l;
> +}
> +
> +float
> +double_reduc3 (float *restrict i, float *restrict j)
> +{
> +  float k = 0, l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 8; b++)
> +      {
> +        k += i[b];
> +        l += j[b];
> +      }
> +  return l * k;
> +}
> +
> +/* We can't yet handle double_reduc1.  */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 3 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 9 } } */
> +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
> +   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
> +   for 64-bit AdvSIMD.  */
> +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } 
> } */
> +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
> +   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
> +   before failing.  */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-21 
> 17:06:24.670434749 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-21 
> 17:06:25.016421335 +0000
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
> -msve-vector-bits=scalable" } */
> +/* The cost model thinks that the double loop isn't a win for SVE-128.  */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
> -msve-vector-bits=scalable -fno-vect-cost-model" } */
>
>  #include <stdint.h>
>
> @@ -24,7 +25,10 @@ #define TEST_ALL(T)                          \
>    T (int32_t)                                  \
>    T (uint32_t)                                 \
>    T (int64_t)                                  \
> -  T (uint64_t)
> +  T (uint64_t)                                 \
> +  T (_Float16)                                 \
> +  T (float)                                    \
> +  T (double)
>
>  TEST_ALL (VEC_PERM)
>
> @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
>  /* ??? We don't treat the uint loops as SLP.  */
>  /* The loop should be fully-masked.  */
>  /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
>  /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
>
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* 
> } } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* 
> } } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* 
> } } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.s\n} 2 } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.d\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-not {\tfadd\n} } } */
>
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */
> Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90
> ===================================================================
> --- gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-21 17:06:24.670434749 
> +0000
> +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-21 17:06:25.016421335 
> +0000
> @@ -704,5 +704,5 @@ CALL track('KERNEL  ')
>  RETURN
>  END SUBROUTINE kernel
>
> -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target 
> { vect_intdouble_cvt } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 22 loops" 1 "vect" { target 
> vect_intdouble_cvt } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target 
> { ! vect_intdouble_cvt } } } }

Re: Add support for in-order addition reduction using SVE FADDA

Reply via email to