"Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes:
> Hi,
>
> This patch implements the costing function 
> determine_suggested_unroll_factor for aarch64.
> It determines the unrolling factor by dividing the number of X 
> operations we can do per cycle by the number of X operations in the loop 
> body, taking this information from the vec_ops analysis during vector 
> costing and the available issue_info information.
> We multiply the dividend by a potential reduction_latency, to improve 
> our pipeline utilization if we are stalled waiting on a particular 
> reduction operation.
>
> Right now we also have a work around for vectorization choices where the 
> main loop uses a NEON mode and predication is available, such that if 
> the main loop makes use of a NEON pattern that is not directly supported 
> by SVE we do not unroll, as that might cause performance regressions in 
> cases where we would enter the original main loop's VF. As an example if 
> you have a loop where you could use AVG_CEIL with a V8HI mode, you would 
> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated 
> epilogue, using other instructions. Whereas with the unrolling you would 
> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping 
> the original 8x NEON. In the future, we could handle this differently, 
> by either using a different costing model for epilogues, or potentially 
> vectorizing more than one single epilogue.
>
> gcc/ChangeLog:
>
>          * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
> determine_suggested_unroll_factor.
>          (determine_suggested_unroll_factor): New function.
>          (aarch64_vector_costs::finish_costs): Use 
> determine_suggested_unroll_factor.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15680,6 +15680,7 @@ private:
>    unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>                                unsigned int);
>    bool prefer_unrolled_loop () const;
> +  unsigned int determine_suggested_unroll_factor ();
>  
>    /* True if we have performed one-time initialization based on the
>       vec_info.  */
> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count 
> *ops,
>    return sve_cycles_per_iter;
>  }
>  
> +unsigned int
> +aarch64_vector_costs::determine_suggested_unroll_factor ()
> +{
> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
> +  if (!issue_info)
> +    return 1;
> +  bool sve = false;
> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))

Other code uses m_vec_flags & VEC_ANY_SVE for this.

> +    {
> +      if (!issue_info->sve)
> +     return 1;
> +      sve = true;
> +    }
> +  else
> +    {
> +      if (!issue_info->advsimd)
> +     return 1;

The issue info should instead be taken from vec_ops.simd_issue_info ()
in the loop below.  It can vary for each entry.

> +      /* If we are trying to unroll a NEON main loop that contains patterns

s/a NEON/an Advanced SIMD/

> +      that we do not support with SVE and we might use a predicated
> +      epilogue, we need to be conservative and block unrolling as this might
> +      lead to a less optimal loop for the first and only epilogue using the
> +      original loop's vectorization factor.
> +      TODO: Remove this constraint when we add support for multiple epilogue
> +      vectorization.  */
> +      if (partial_vectors_supported_p ()
> +       && param_vect_partial_vector_usage != 0
> +       && !TARGET_SVE2)
> +     {
> +       unsigned int i;
> +       stmt_vec_info stmt_vinfo;
> +       FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
> +         {
> +           if (is_pattern_stmt_p (stmt_vinfo))
> +             {
> +               gimple *stmt = stmt_vinfo->stmt;
> +               if (is_gimple_call (stmt)
> +                   && gimple_call_internal_p (stmt))
> +                 {
> +                   enum internal_fn ifn
> +                     = gimple_call_internal_fn (stmt);
> +                   switch (ifn)
> +                     {
> +                     case IFN_AVG_FLOOR:
> +                     case IFN_AVG_CEIL:
> +                       return 1;
> +                     default:
> +                       break;
> +                     }
> +                 }
> +             }
> +         }
> +     }

I think we should instead record this during add_stmt_cost, like we do
for the other data that this function uses.

We need to drop the partial_vectors_supported_p () condition though:
enabling SVE must not make Advanced SIMD code worse.  So for now,
Advanced SIMD-only code will have to suffer the missed unrolling
opportunity too.

> +    }
> +
> +  unsigned int max_unroll_factor = 1;
> +  aarch64_simd_vec_issue_info const *vec_issue
> +    = sve ? issue_info->sve : issue_info->advsimd;
> +  for (auto vec_ops : m_ops)
> +    {
> +      /* Limit unroll factor to 4 for now.  */
> +      unsigned int unroll_factor = 4;

Would be good to make this an aarch64-specific --param in case people
want to play with different values.

> +      unsigned int factor
> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
> +      unsigned int temp;
> +
> +      /* Sanity check, this should never happen.  */
> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
> +     return 1;
> +
> +      /* Check stores.  */
> +      if (vec_ops.stores > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->stores_per_cycle,
> +                    vec_ops.stores);
> +       unroll_factor = MIN (unroll_factor, temp);
> +     }
> +
> +      /* Check loads.  */
> +      if (vec_ops.loads > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
> +                    vec_ops.loads);

This should use loads + stores rather than just loads.

> +       unroll_factor = MIN (unroll_factor, temp);
> +     }
> +
> +      /* Check general ops.  */
> +      if (vec_ops.general_ops > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->general_ops_per_cycle,
> +                    vec_ops.general_ops);
> +       unroll_factor = MIN (unroll_factor, temp);
> +      }

In principle we should also take pred_ops into account when
sve_issue_info () is nonnull, while ignoring the pred_ops associated
with loop predication (added by analyze_loop_vinfo).  I guess we
should maintain two pred_op counts.

That's just a suggestion for a possible future improvement though.
It doesn't need to be done as part of this patch.

Thanks,
Richard

> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
> +    }
> +
> +  /* Make sure unroll factor is power of 2.  */
> +  return 1 << ceil_log2 (max_unroll_factor);
> +}
> +
>  /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>     and return the new cost.  */
>  unsigned int
> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const 
> vector_costs *uncast_scalar_costs)
>    if (loop_vinfo
>        && m_vec_flags
>        && aarch64_use_new_vector_costs_p ())
> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> -                                        m_costs[vect_body]);
> +    {
> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> +                                          m_costs[vect_body]);
> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
> +    }
>  
>    /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>       the scalar code in the event of a tie, since there is more chance

Reply via email to