Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Tue, 22 Oct 2024 07:05:57 -0700

> On 18 Oct 2024, at 09:11, Richard Biener <rguent...@suse.de> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 17 Oct 2024, Richard Sandiford wrote:
> 
>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> [...]
>>> Looking at the diff of the vect dumps (below is a section of the diff for 
>>> strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, 
>>> instead of the previous cost of 2:
>>> 
>>> +strided_store_1.c:38:151: note:    === vectorizable_operation ===
>>> +strided_store_1.c:38:151: note:    vect_model_simple_cost: inside_cost = 
>>> 1, prologue_cost  = 0 .
>>> +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
>>> +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 1.0e+0, 
>>> type of def:    internal
>>> +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
>>> +strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost = 12, 
>>> prologue_cost = 0 .
>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
>>> -_7 1 times vec_to_scalar costs 2 in body
>>> +<unknown> 1 times vector_load costs 1 in prologue
>>> +_7 1 times vec_to_scalar costs 0 in body
>>> _7 1 times scalar_store costs 1 in body
>>> -_7 1 times vec_to_scalar costs 2 in body
>>> +_7 1 times vec_to_scalar costs 0 in body
>>> _7 1 times scalar_store costs 1 in body
>>> -_7 1 times vec_to_scalar costs 2 in body
>>> +_7 1 times vec_to_scalar costs 0 in body
>>> _7 1 times scalar_store costs 1 in body
>>> -_7 1 times vec_to_scalar costs 2 in body
>>> +_7 1 times vec_to_scalar costs 0 in body
>>> _7 1 times scalar_store costs 1 in body
>>> 
>>> Although the aarch64_use_new_vector_costs_p flag was used in multiple 
>>> places in aarch64.cc, the location that causes this behavior is this one:
>>> unsigned
>>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>>>                                     stmt_vec_info stmt_info, slp_tree,
>>>                                     tree vectype, int misalign,
>>>                                     vect_cost_model_location where)
>>> {
>>>  [...]
>>>  /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>     of just looking at KIND.  */
>>> -  if (stmt_info && aarch64_use_new_vector_costs_p ())
>>> +  if (stmt_info)
>>>    {
>>>      /* If we scalarize a strided store, the vectorizer costs one
>>>         vec_to_scalar for each element.  However, we can store the first
>>>         element using an FP store without a separate extract step.  */
>>>      if (vect_is_store_elt_extraction (kind, stmt_info))
>>>        count -= 1;
>>> 
>>>      stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>                                                      stmt_info, stmt_cost);
>>> 
>>>      if (vectype && m_vec_flags)
>>>        stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
>>>                                                        stmt_info, vectype,
>>>                                                        where, stmt_cost);
>>>    }
>>>  [...]
>>>  return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>>> }
>>> 
>>> Previously, for mtune=generic, this function returned a cost of 2 for a 
>>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered 
>>> and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to 
>>> true, which sets the count to 0 and leads to a return value of 0.
>> 
>> At the time the code was written, a scalarised store would be costed
>> using one vec_to_scalar call into the backend, with the count parameter
>> set to the number of elements being stored.  The "count -= 1" was
>> supposed to lop off the leading element extraction, since we can store
>> lane 0 as a normal FP store.
>> 
>> The target-independent costing was later reworked so that it costs
>> each operation individually:
>> 
>>            for (i = 0; i < nstores; i++)
>>              {
>>                if (costing_p)
>>                  {
>>                    /* Only need vector extracting when there are more
>>                       than one stores.  */
>>                    if (nstores > 1)
>>                      inside_cost
>>                        += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>>                                             stmt_info, 0, vect_body);
>>                    /* Take a single lane vector type store as scalar
>>                       store to avoid ICE like 110776.  */
>>                    if (VECTOR_TYPE_P (ltype)
>>                        && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>>                      n_adjacent_stores++;
>>                    else
>>                      inside_cost
>>                        += record_stmt_cost (cost_vec, 1, scalar_store,
>>                                             stmt_info, 0, vect_body);
>>                    continue;
>>                  }
>> 
>> Unfortunately, there's no easy way of telling whether a particular call
>> is part of a group, and if so, which member of the group it is.
>> 
>> I suppose we could give up on the attempt to be (somewhat) accurate
>> and just disable the optimisation.  Or we could restrict it to count > 1,
>> since it might still be useful for gathers and scatters.
>> 
>> cc:ing Richard in case he has any thoughts.
> 
> There's the n_adjacent_stores thing though which I think could be
> extended to also cover the vec_to_scalar part (duplicate it to
> the else case shown above and the
> 
>      if (costing_p)
>        {
>          if (n_adjacent_stores > 0)
>            vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
>                                 alignment_support_scheme, misalignment,
>                                 &inside_cost, cost_vec);
> 
> case after the processing loop.
I looked into using n_adjacent_stores for this. Currently, the use of 
n_adjacent_stores seems to be connected to the value of inside_cost. But is 
that really what is off here? I was not able to use it to make changes to how 
the calls to aarch64_vector_costs:add_stmt_cost proceed. They still return 0 
for vec_to_scalar in the vect_body.
Can you elaborate on how you envisioned the use of n_adjacent_stores in this 
case?
Thanks,
Jennifer
> 
> Richard.
smime.p7s
Description: S/MIME cryptographic signature
Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to