"Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes:
> Hi,
>
> Addressed all of your comments bar the pred ops one.
>
> Is this OK?
>
>
> gcc/ChangeLog:
>
>          * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
> determine_suggested_unroll_factor and m_nosve_pattern.
>          (determine_suggested_unroll_factor): New function.
>          (aarch64_vector_costs::add_stmt_cost): Check for a qualifying 
> pattern
>          to set m_nosve_pattern.
>          (aarch64_vector_costs::finish_costs): Use 
> determine_suggested_unroll_factor.
>          * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New.
>
> On 16/03/2022 18:01, Richard Sandiford wrote:
>> "Andre Vieira (lists)" <andre.simoesdiasvie...@arm.com> writes:
>>> Hi,
>>>
>>> This patch implements the costing function
>>> determine_suggested_unroll_factor for aarch64.
>>> It determines the unrolling factor by dividing the number of X
>>> operations we can do per cycle by the number of X operations in the loop
>>> body, taking this information from the vec_ops analysis during vector
>>> costing and the available issue_info information.
>>> We multiply the dividend by a potential reduction_latency, to improve
>>> our pipeline utilization if we are stalled waiting on a particular
>>> reduction operation.
>>>
>>> Right now we also have a work around for vectorization choices where the
>>> main loop uses a NEON mode and predication is available, such that if
>>> the main loop makes use of a NEON pattern that is not directly supported
>>> by SVE we do not unroll, as that might cause performance regressions in
>>> cases where we would enter the original main loop's VF. As an example if
>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
>>> epilogue, using other instructions. Whereas with the unrolling you would
>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
>>> the original 8x NEON. In the future, we could handle this differently,
>>> by either using a different costing model for epilogues, or potentially
>>> vectorizing more than one single epilogue.
>>>
>>> gcc/ChangeLog:
>>>
>>>           * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>>> determine_suggested_unroll_factor.
>>>           (determine_suggested_unroll_factor): New function.
>>>           (aarch64_vector_costs::finish_costs): Use
>>> determine_suggested_unroll_factor.
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index 
>>> b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521
>>>  100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -15680,6 +15680,7 @@ private:
>>>     unsigned int adjust_body_cost (loop_vec_info, const 
>>> aarch64_vector_costs *,
>>>                              unsigned int);
>>>     bool prefer_unrolled_loop () const;
>>> +  unsigned int determine_suggested_unroll_factor ();
>>>   
>>>     /* True if we have performed one-time initialization based on the
>>>        vec_info.  */
>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count 
>>> *ops,
>>>     return sve_cycles_per_iter;
>>>   }
>>>   
>>> +unsigned int
>>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>>> +{
>>> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
>>> +  if (!issue_info)
>>> +    return 1;
>>> +  bool sve = false;
>>> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
>> Other code uses m_vec_flags & VEC_ANY_SVE for this.
>>
>>> +    {
>>> +      if (!issue_info->sve)
>>> +   return 1;
>>> +      sve = true;
>>> +    }
>>> +  else
>>> +    {
>>> +      if (!issue_info->advsimd)
>>> +   return 1;
>> The issue info should instead be taken from vec_ops.simd_issue_info ()
>> in the loop below.  It can vary for each entry.
>>
>>> +      /* If we are trying to unroll a NEON main loop that contains patterns
>> s/a NEON/an Advanced SIMD/
>>
>>> +    that we do not support with SVE and we might use a predicated
>>> +    epilogue, we need to be conservative and block unrolling as this might
>>> +    lead to a less optimal loop for the first and only epilogue using the
>>> +    original loop's vectorization factor.
>>> +    TODO: Remove this constraint when we add support for multiple epilogue
>>> +    vectorization.  */
>>> +      if (partial_vectors_supported_p ()
>>> +     && param_vect_partial_vector_usage != 0
>>> +     && !TARGET_SVE2)
>>> +   {
>>> +     unsigned int i;
>>> +     stmt_vec_info stmt_vinfo;
>>> +     FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
>>> +       {
>>> +         if (is_pattern_stmt_p (stmt_vinfo))
>>> +           {
>>> +             gimple *stmt = stmt_vinfo->stmt;
>>> +             if (is_gimple_call (stmt)
>>> +                 && gimple_call_internal_p (stmt))
>>> +               {
>>> +                 enum internal_fn ifn
>>> +                   = gimple_call_internal_fn (stmt);
>>> +                 switch (ifn)
>>> +                   {
>>> +                   case IFN_AVG_FLOOR:
>>> +                   case IFN_AVG_CEIL:
>>> +                     return 1;
>>> +                   default:
>>> +                     break;
>>> +                   }
>>> +               }
>>> +           }
>>> +       }
>>> +   }
>> I think we should instead record this during add_stmt_cost, like we do
>> for the other data that this function uses.
>>
>> We need to drop the partial_vectors_supported_p () condition though:
>> enabling SVE must not make Advanced SIMD code worse.  So for now,
>> Advanced SIMD-only code will have to suffer the missed unrolling
>> opportunity too.
>>
>>> +    }
>>> +
>>> +  unsigned int max_unroll_factor = 1;
>>> +  aarch64_simd_vec_issue_info const *vec_issue
>>> +    = sve ? issue_info->sve : issue_info->advsimd;
>>> +  for (auto vec_ops : m_ops)
>>> +    {
>>> +      /* Limit unroll factor to 4 for now.  */
>>> +      unsigned int unroll_factor = 4;
>> Would be good to make this an aarch64-specific --param in case people
>> want to play with different values.
>>
>>> +      unsigned int factor
>>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>>> +      unsigned int temp;
>>> +
>>> +      /* Sanity check, this should never happen.  */
>>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>>> +   return 1;
>>> +
>>> +      /* Check stores.  */
>>> +      if (vec_ops.stores > 0)
>>> +   {
>>> +     temp = CEIL (factor * vec_issue->stores_per_cycle,
>>> +                  vec_ops.stores);
>>> +     unroll_factor = MIN (unroll_factor, temp);
>>> +   }
>>> +
>>> +      /* Check loads.  */
>>> +      if (vec_ops.loads > 0)
>>> +   {
>>> +     temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>>> +                  vec_ops.loads);
>> This should use loads + stores rather than just loads.
>>
>>> +     unroll_factor = MIN (unroll_factor, temp);
>>> +   }
>>> +
>>> +      /* Check general ops.  */
>>> +      if (vec_ops.general_ops > 0)
>>> +   {
>>> +     temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>>> +                  vec_ops.general_ops);
>>> +     unroll_factor = MIN (unroll_factor, temp);
>>> +    }
>> In principle we should also take pred_ops into account when
>> sve_issue_info () is nonnull, while ignoring the pred_ops associated
>> with loop predication (added by analyze_loop_vinfo).  I guess we
>> should maintain two pred_op counts.
>>
>> That's just a suggestion for a possible future improvement though.
>> It doesn't need to be done as part of this patch.
>>
>> Thanks,
>> Richard
>>
>>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>>> +    }
>>> +
>>> +  /* Make sure unroll factor is power of 2.  */
>>> +  return 1 << ceil_log2 (max_unroll_factor);
>>> +}
>>> +
>>>   /* BODY_COST is the cost of a vector loop body.  Adjust the cost as 
>>> necessary
>>>      and return the new cost.  */
>>>   unsigned int
>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const 
>>> vector_costs *uncast_scalar_costs)
>>>     if (loop_vinfo
>>>         && m_vec_flags
>>>         && aarch64_use_new_vector_costs_p ())
>>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>> -                                      m_costs[vect_body]);
>>> +    {
>>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>> +                                        m_costs[vect_body]);
>>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>>> +    }
>>>   
>>>     /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>>        the scalar code in the event of a tie, since there is more chance
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15637,6 +15637,7 @@ private:
>    unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>                                unsigned int);
>    bool prefer_unrolled_loop () const;
> +  unsigned int determine_suggested_unroll_factor ();
>  
>    /* True if we have performed one-time initialization based on the
>       vec_info.  */
> @@ -15699,6 +15700,10 @@ private:
>       or vector loop.  There is one entry for each tuning option of
>       interest.  */
>    auto_vec<aarch64_vec_op_count, 2> m_ops;
> +
> +  /* This loop uses a pattern that is not supported by SVE, but is supported 
> by
> +     Advanced SIMD and SVE2.  */
> +  bool m_nosve_pattern = false;

This is true for other things besides AVG_FLOOR/CEIL.  I think we should
just accept that this is an x264 hack ;-) and name it “m_has_avg” instead.

Could you put it with the other bool field?  That would lead to a nicer
layout, although it hardly matters much here.

>  };
>  
>  aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>        as one iteration of the SVE loop.  */
>        if (where == vect_body && m_unrolled_advsimd_niters)
>       m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
> +
> +      if (is_pattern_stmt_p (stmt_info))
> +     {

Not sure this is necessary.  We'd want the same thing if (for whatever
reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future.

> +       gimple *stmt = stmt_info->stmt;
> +       if (is_gimple_call (stmt)
> +           && gimple_call_internal_p (stmt))
> +         {
> +           enum internal_fn ifn
> +             = gimple_call_internal_fn (stmt);
> +           switch (ifn)

Very minor nit, but might as well switch on gimple_call_internal_fn (stmt)
directly, to avoid the awkward line break.

> +             {
> +             case IFN_AVG_FLOOR:
> +             case IFN_AVG_CEIL:
> +               m_nosve_pattern = true;
> +             default:
> +               break;
> +             }
> +         }
> +     }
>      }
>    return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>  }
> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count 
> *ops,
>    return sve_cycles_per_iter;
>  }
>  
> +unsigned int
> +aarch64_vector_costs::determine_suggested_unroll_factor ()
> +{
> +  bool sve = m_vec_flags & VEC_ANY_SVE;
> +  /* If we are trying to unroll an Advanced SIMD main loop that contains
> +     patterns that we do not support with SVE and we might use a predicated
> +     epilogue, we need to be conservative and block unrolling as this might
> +     lead to a less optimal loop for the first and only epilogue using the
> +     original loop's vectorization factor.
> +     TODO: Remove this constraint when we add support for multiple epilogue
> +     vectorization.  */
> +  if (!sve && TARGET_SVE2 && m_nosve_pattern)
> +    return 1;

Shouldn't this be !TARGET_SVE2?

> +
> +  unsigned int max_unroll_factor = 1;
> +  for (auto vec_ops : m_ops)
> +    {
> +      aarch64_simd_vec_issue_info const *vec_issue
> +     = vec_ops.simd_issue_info ();
> +      if (!vec_issue)
> +     return 1;
> +      /* Limit unroll factor to a value adjustable by the user, the default
> +      value is 4. */
> +      unsigned int unroll_factor = aarch64_vect_unroll_limit;
> +      unsigned int factor
> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
> +      unsigned int temp;
> +
> +      /* Sanity check, this should never happen.  */
> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
> +     return 1;
> +
> +      /* Check stores.  */
> +      if (vec_ops.stores > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->stores_per_cycle,
> +                    vec_ops.stores);
> +       unroll_factor = MIN (unroll_factor, temp);
> +     }
> +
> +      /* Check loads + stores.  */
> +      if (vec_ops.loads > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
> +                    vec_ops.loads + vec_ops.stores);
> +       unroll_factor = MIN (unroll_factor, temp);
> +     }
> +
> +      /* Check general ops.  */
> +      if (vec_ops.general_ops > 0)
> +     {
> +       temp = CEIL (factor * vec_issue->general_ops_per_cycle,
> +                    vec_ops.general_ops);
> +       unroll_factor = MIN (unroll_factor, temp);
> +      }
> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
> +    }
> +
> +  /* Make sure unroll factor is power of 2.  */
> +  return 1 << ceil_log2 (max_unroll_factor);
> +}
> +
>  /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>     and return the new cost.  */
>  unsigned int
> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const 
> vector_costs *uncast_scalar_costs)
>    if (loop_vinfo
>        && m_vec_flags
>        && aarch64_use_new_vector_costs_p ())
> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> -                                        m_costs[vect_body]);
> +    {
> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> +                                          m_costs[vect_body]);
> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
> +    }
>  
>    /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>       the scalar code in the event of a tie, since there is more chance
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 
> 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5
>  100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using 
> MOPS sequence.
>  -param=aarch64-mops-memset-size-threshold=
>  Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) 
> Param
>  Constant memset size in bytes from which to start using MOPS sequence.
> +
> +-param=aarch64-vect-unroll-limit=
> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> +Limit how much the autovectorizer may unroll a loop.

This needs to be documented in invoke.texi, alongside the existing
AArch64-specific params.

OK with those changes, thanks.

Richard

Reply via email to