On Thu, 17 Oct 2024, Richard Sandiford wrote:

> Jennifer Schmitz <jschm...@nvidia.com> writes:
> > [...]
> > Looking at the diff of the vect dumps (below is a section of the diff for 
> > strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, 
> > instead of the previous cost of 2:
> >
> > +strided_store_1.c:38:151: note:    === vectorizable_operation ===
> > +strided_store_1.c:38:151: note:    vect_model_simple_cost: inside_cost = 
> > 1, prologue_cost  = 0 .
> > +strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
> > +strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 1.0e+0, 
> > type of def:    internal
> > +strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
> > +Applying pattern match.pd:236, generic-match-9.cc:4128
> > +Applying pattern match.pd:5285, generic-match-10.cc:4234
> > +strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost = 12, 
> > prologue_cost = 0 .
> >  *_2 1 times unaligned_load (misalign -1) costs 1 in body
> > -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
> >  _3 + 1.0e+0 1 times vector_stmt costs 1 in body
> > -_7 1 times vec_to_scalar costs 2 in body
> > +<unknown> 1 times vector_load costs 1 in prologue
> > +_7 1 times vec_to_scalar costs 0 in body
> >  _7 1 times scalar_store costs 1 in body
> > -_7 1 times vec_to_scalar costs 2 in body
> > +_7 1 times vec_to_scalar costs 0 in body
> >  _7 1 times scalar_store costs 1 in body
> > -_7 1 times vec_to_scalar costs 2 in body
> > +_7 1 times vec_to_scalar costs 0 in body
> >  _7 1 times scalar_store costs 1 in body
> > -_7 1 times vec_to_scalar costs 2 in body
> > +_7 1 times vec_to_scalar costs 0 in body
> >  _7 1 times scalar_store costs 1 in body
> >
> > Although the aarch64_use_new_vector_costs_p flag was used in multiple 
> > places in aarch64.cc, the location that causes this behavior is this one:
> > unsigned
> > aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >                                      stmt_vec_info stmt_info, slp_tree,
> >                                      tree vectype, int misalign,
> >                                      vect_cost_model_location where)
> > {
> >   [...]
> >   /* Try to get a more accurate cost by looking at STMT_INFO instead
> >      of just looking at KIND.  */
> > -  if (stmt_info && aarch64_use_new_vector_costs_p ())
> > +  if (stmt_info)
> >     {
> >       /* If we scalarize a strided store, the vectorizer costs one
> >          vec_to_scalar for each element.  However, we can store the first
> >          element using an FP store without a separate extract step.  */
> >       if (vect_is_store_elt_extraction (kind, stmt_info))
> >         count -= 1;
> >
> >       stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >                                                       stmt_info, stmt_cost);
> >
> >       if (vectype && m_vec_flags)
> >         stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
> >                                                         stmt_info, vectype,
> >                                                         where, stmt_cost);
> >     }
> >   [...]
> >   return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
> > }
> >
> > Previously, for mtune=generic, this function returned a cost of 2 for a 
> > vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered 
> > and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to 
> > true, which sets the count to 0 and leads to a return value of 0.
> 
> At the time the code was written, a scalarised store would be costed
> using one vec_to_scalar call into the backend, with the count parameter
> set to the number of elements being stored.  The "count -= 1" was
> supposed to lop off the leading element extraction, since we can store
> lane 0 as a normal FP store.
> 
> The target-independent costing was later reworked so that it costs
> each operation individually:
> 
>             for (i = 0; i < nstores; i++)
>               {
>                 if (costing_p)
>                   {
>                     /* Only need vector extracting when there are more
>                        than one stores.  */
>                     if (nstores > 1)
>                       inside_cost
>                         += record_stmt_cost (cost_vec, 1, vec_to_scalar,
>                                              stmt_info, 0, vect_body);
>                     /* Take a single lane vector type store as scalar
>                        store to avoid ICE like 110776.  */
>                     if (VECTOR_TYPE_P (ltype)
>                         && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
>                       n_adjacent_stores++;
>                     else
>                       inside_cost
>                         += record_stmt_cost (cost_vec, 1, scalar_store,
>                                              stmt_info, 0, vect_body);
>                     continue;
>                   }
> 
> Unfortunately, there's no easy way of telling whether a particular call
> is part of a group, and if so, which member of the group it is.
> 
> I suppose we could give up on the attempt to be (somewhat) accurate
> and just disable the optimisation.  Or we could restrict it to count > 1,
> since it might still be useful for gathers and scatters.
> 
> cc:ing Richard in case he has any thoughts.

There's the n_adjacent_stores thing though which I think could be
extended to also cover the vec_to_scalar part (duplicate it to
the else case shown above and the
                         
      if (costing_p)  
        {                 
          if (n_adjacent_stores > 0)
            vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
                                 alignment_support_scheme, misalignment,
                                 &inside_cost, cost_vec);

case after the processing loop.

Richard.

Reply via email to