On Tue, 22 Oct 2024, Jennifer Schmitz wrote: > > > > On 18 Oct 2024, at 09:11, Richard Biener <rguent...@suse.de> wrote: > > > > External email: Use caution opening links or attachments > > > > > > On Thu, 17 Oct 2024, Richard Sandiford wrote: > > > >> Jennifer Schmitz <jschm...@nvidia.com> writes: > >>> [...] > >>> Looking at the diff of the vect dumps (below is a section of the diff for > >>> strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 > >>> now, instead of the previous cost of 2: > >>> > >>> +strided_store_1.c:38:151: note: === vectorizable_operation === > >>> +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = > >>> 1, prologue_cost = 0 . > >>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; > >>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + > >>> 1.0e+0, type of def: internal > >>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access. > >>> +Applying pattern match.pd:236, generic-match-9.cc:4128 > >>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 > >>> +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = > >>> 12, prologue_cost = 0 . > >>> *_2 1 times unaligned_load (misalign -1) costs 1 in body > >>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue > >>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body > >>> -_7 1 times vec_to_scalar costs 2 in body > >>> +<unknown> 1 times vector_load costs 1 in prologue > >>> +_7 1 times vec_to_scalar costs 0 in body > >>> _7 1 times scalar_store costs 1 in body > >>> -_7 1 times vec_to_scalar costs 2 in body > >>> +_7 1 times vec_to_scalar costs 0 in body > >>> _7 1 times scalar_store costs 1 in body > >>> -_7 1 times vec_to_scalar costs 2 in body > >>> +_7 1 times vec_to_scalar costs 0 in body > >>> _7 1 times scalar_store costs 1 in body > >>> -_7 1 times vec_to_scalar costs 2 in body > >>> +_7 1 times vec_to_scalar costs 0 in body > >>> _7 1 times scalar_store costs 1 in body > >>> > >>> Although the aarch64_use_new_vector_costs_p flag was used in multiple > >>> places in aarch64.cc, the location that causes this behavior is this one: > >>> unsigned > >>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > >>> stmt_vec_info stmt_info, slp_tree, > >>> tree vectype, int misalign, > >>> vect_cost_model_location where) > >>> { > >>> [...] > >>> /* Try to get a more accurate cost by looking at STMT_INFO instead > >>> of just looking at KIND. */ > >>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) > >>> + if (stmt_info) > >>> { > >>> /* If we scalarize a strided store, the vectorizer costs one > >>> vec_to_scalar for each element. However, we can store the first > >>> element using an FP store without a separate extract step. */ > >>> if (vect_is_store_elt_extraction (kind, stmt_info)) > >>> count -= 1; > >>> > >>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, > >>> stmt_info, > >>> stmt_cost); > >>> > >>> if (vectype && m_vec_flags) > >>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > >>> stmt_info, vectype, > >>> where, stmt_cost); > >>> } > >>> [...] > >>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); > >>> } > >>> > >>> Previously, for mtune=generic, this function returned a cost of 2 for a > >>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered > >>> and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to > >>> true, which sets the count to 0 and leads to a return value of 0. > >> > >> At the time the code was written, a scalarised store would be costed > >> using one vec_to_scalar call into the backend, with the count parameter > >> set to the number of elements being stored. The "count -= 1" was > >> supposed to lop off the leading element extraction, since we can store > >> lane 0 as a normal FP store. > >> > >> The target-independent costing was later reworked so that it costs > >> each operation individually: > >> > >> for (i = 0; i < nstores; i++) > >> { > >> if (costing_p) > >> { > >> /* Only need vector extracting when there are more > >> than one stores. */ > >> if (nstores > 1) > >> inside_cost > >> += record_stmt_cost (cost_vec, 1, vec_to_scalar, > >> stmt_info, 0, vect_body); > >> /* Take a single lane vector type store as scalar > >> store to avoid ICE like 110776. */ > >> if (VECTOR_TYPE_P (ltype) > >> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) > >> n_adjacent_stores++; > >> else > >> inside_cost > >> += record_stmt_cost (cost_vec, 1, scalar_store, > >> stmt_info, 0, vect_body); > >> continue; > >> } > >> > >> Unfortunately, there's no easy way of telling whether a particular call > >> is part of a group, and if so, which member of the group it is. > >> > >> I suppose we could give up on the attempt to be (somewhat) accurate > >> and just disable the optimisation. Or we could restrict it to count > 1, > >> since it might still be useful for gathers and scatters. > >> > >> cc:ing Richard in case he has any thoughts. > > > > There's the n_adjacent_stores thing though which I think could be > > extended to also cover the vec_to_scalar part (duplicate it to > > the else case shown above and the > > > > if (costing_p) > > { > > if (n_adjacent_stores > 0) > > vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores, > > alignment_support_scheme, misalignment, > > &inside_cost, cost_vec); > > > > case after the processing loop. > I looked into using n_adjacent_stores for this. Currently, the use of > n_adjacent_stores seems to be connected to the value of inside_cost. But is > that really what is off here? I was not able to use it to make changes to how > the calls to aarch64_vector_costs:add_stmt_cost proceed. They still return 0 > for vec_to_scalar in the vect_body. > Can you elaborate on how you envisioned the use of n_adjacent_stores in this > case?
I was suggesting to change vectorizable_store so that 'count' is again > 1, like it is for the actual store. Richard.