On Thu, 17 Oct 2024, Richard Sandiford wrote: > Jennifer Schmitz <jschm...@nvidia.com> writes: > > [...] > > Looking at the diff of the vect dumps (below is a section of the diff for > > strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, > > instead of the previous cost of 2: > > > > +strided_store_1.c:38:151: note: === vectorizable_operation === > > +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = > > 1, prologue_cost = 0 . > > +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; > > +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + 1.0e+0, > > type of def: internal > > +strided_store_1.c:38:151: note: Vectorizing an unaligned access. > > +Applying pattern match.pd:236, generic-match-9.cc:4128 > > +Applying pattern match.pd:5285, generic-match-10.cc:4234 > > +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = 12, > > prologue_cost = 0 . > > *_2 1 times unaligned_load (misalign -1) costs 1 in body > > -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue > > _3 + 1.0e+0 1 times vector_stmt costs 1 in body > > -_7 1 times vec_to_scalar costs 2 in body > > +<unknown> 1 times vector_load costs 1 in prologue > > +_7 1 times vec_to_scalar costs 0 in body > > _7 1 times scalar_store costs 1 in body > > -_7 1 times vec_to_scalar costs 2 in body > > +_7 1 times vec_to_scalar costs 0 in body > > _7 1 times scalar_store costs 1 in body > > -_7 1 times vec_to_scalar costs 2 in body > > +_7 1 times vec_to_scalar costs 0 in body > > _7 1 times scalar_store costs 1 in body > > -_7 1 times vec_to_scalar costs 2 in body > > +_7 1 times vec_to_scalar costs 0 in body > > _7 1 times scalar_store costs 1 in body > > > > Although the aarch64_use_new_vector_costs_p flag was used in multiple > > places in aarch64.cc, the location that causes this behavior is this one: > > unsigned > > aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > stmt_vec_info stmt_info, slp_tree, > > tree vectype, int misalign, > > vect_cost_model_location where) > > { > > [...] > > /* Try to get a more accurate cost by looking at STMT_INFO instead > > of just looking at KIND. */ > > - if (stmt_info && aarch64_use_new_vector_costs_p ()) > > + if (stmt_info) > > { > > /* If we scalarize a strided store, the vectorizer costs one > > vec_to_scalar for each element. However, we can store the first > > element using an FP store without a separate extract step. */ > > if (vect_is_store_elt_extraction (kind, stmt_info)) > > count -= 1; > > > > stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > > stmt_info, stmt_cost); > > > > if (vectype && m_vec_flags) > > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > > stmt_info, vectype, > > where, stmt_cost); > > } > > [...] > > return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); > > } > > > > Previously, for mtune=generic, this function returned a cost of 2 for a > > vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered > > and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to > > true, which sets the count to 0 and leads to a return value of 0. > > At the time the code was written, a scalarised store would be costed > using one vec_to_scalar call into the backend, with the count parameter > set to the number of elements being stored. The "count -= 1" was > supposed to lop off the leading element extraction, since we can store > lane 0 as a normal FP store. > > The target-independent costing was later reworked so that it costs > each operation individually: > > for (i = 0; i < nstores; i++) > { > if (costing_p) > { > /* Only need vector extracting when there are more > than one stores. */ > if (nstores > 1) > inside_cost > += record_stmt_cost (cost_vec, 1, vec_to_scalar, > stmt_info, 0, vect_body); > /* Take a single lane vector type store as scalar > store to avoid ICE like 110776. */ > if (VECTOR_TYPE_P (ltype) > && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > n_adjacent_stores++; > else > inside_cost > += record_stmt_cost (cost_vec, 1, scalar_store, > stmt_info, 0, vect_body); > continue; > } > > Unfortunately, there's no easy way of telling whether a particular call > is part of a group, and if so, which member of the group it is. > > I suppose we could give up on the attempt to be (somewhat) accurate > and just disable the optimisation. Or we could restrict it to count > 1, > since it might still be useful for gathers and scatters. > > cc:ing Richard in case he has any thoughts.
There's the n_adjacent_stores thing though which I think could be extended to also cover the vec_to_scalar part (duplicate it to the else case shown above and the if (costing_p) { if (n_adjacent_stores > 0) vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores, alignment_support_scheme, misalignment, &inside_cost, cost_vec); case after the processing loop. Richard.