> On 18 Oct 2024, at 09:11, Richard Biener <rguent...@suse.de> wrote: > > External email: Use caution opening links or attachments > > > On Thu, 17 Oct 2024, Richard Sandiford wrote: > >> Jennifer Schmitz <jschm...@nvidia.com> writes: >>> [...] >>> Looking at the diff of the vect dumps (below is a section of the diff for >>> strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, >>> instead of the previous cost of 2: >>> >>> +strided_store_1.c:38:151: note: === vectorizable_operation === >>> +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = >>> 1, prologue_cost = 0 . >>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; >>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + 1.0e+0, >>> type of def: internal >>> +strided_store_1.c:38:151: note: Vectorizing an unaligned access. >>> +Applying pattern match.pd:236, generic-match-9.cc:4128 >>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 >>> +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = 12, >>> prologue_cost = 0 . >>> *_2 1 times unaligned_load (misalign -1) costs 1 in body >>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue >>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body >>> -_7 1 times vec_to_scalar costs 2 in body >>> +<unknown> 1 times vector_load costs 1 in prologue >>> +_7 1 times vec_to_scalar costs 0 in body >>> _7 1 times scalar_store costs 1 in body >>> -_7 1 times vec_to_scalar costs 2 in body >>> +_7 1 times vec_to_scalar costs 0 in body >>> _7 1 times scalar_store costs 1 in body >>> -_7 1 times vec_to_scalar costs 2 in body >>> +_7 1 times vec_to_scalar costs 0 in body >>> _7 1 times scalar_store costs 1 in body >>> -_7 1 times vec_to_scalar costs 2 in body >>> +_7 1 times vec_to_scalar costs 0 in body >>> _7 1 times scalar_store costs 1 in body >>> >>> Although the aarch64_use_new_vector_costs_p flag was used in multiple >>> places in aarch64.cc, the location that causes this behavior is this one: >>> unsigned >>> aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, >>> stmt_vec_info stmt_info, slp_tree, >>> tree vectype, int misalign, >>> vect_cost_model_location where) >>> { >>> [...] >>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>> of just looking at KIND. */ >>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>> + if (stmt_info) >>> { >>> /* If we scalarize a strided store, the vectorizer costs one >>> vec_to_scalar for each element. However, we can store the first >>> element using an FP store without a separate extract step. */ >>> if (vect_is_store_elt_extraction (kind, stmt_info)) >>> count -= 1; >>> >>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>> stmt_info, stmt_cost); >>> >>> if (vectype && m_vec_flags) >>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>> stmt_info, vectype, >>> where, stmt_cost); >>> } >>> [...] >>> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); >>> } >>> >>> Previously, for mtune=generic, this function returned a cost of 2 for a >>> vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered >>> and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to >>> true, which sets the count to 0 and leads to a return value of 0. >> >> At the time the code was written, a scalarised store would be costed >> using one vec_to_scalar call into the backend, with the count parameter >> set to the number of elements being stored. The "count -= 1" was >> supposed to lop off the leading element extraction, since we can store >> lane 0 as a normal FP store. >> >> The target-independent costing was later reworked so that it costs >> each operation individually: >> >> for (i = 0; i < nstores; i++) >> { >> if (costing_p) >> { >> /* Only need vector extracting when there are more >> than one stores. */ >> if (nstores > 1) >> inside_cost >> += record_stmt_cost (cost_vec, 1, vec_to_scalar, >> stmt_info, 0, vect_body); >> /* Take a single lane vector type store as scalar >> store to avoid ICE like 110776. */ >> if (VECTOR_TYPE_P (ltype) >> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >> n_adjacent_stores++; >> else >> inside_cost >> += record_stmt_cost (cost_vec, 1, scalar_store, >> stmt_info, 0, vect_body); >> continue; >> } >> >> Unfortunately, there's no easy way of telling whether a particular call >> is part of a group, and if so, which member of the group it is. >> >> I suppose we could give up on the attempt to be (somewhat) accurate >> and just disable the optimisation. Or we could restrict it to count > 1, >> since it might still be useful for gathers and scatters. >> >> cc:ing Richard in case he has any thoughts. > > There's the n_adjacent_stores thing though which I think could be > extended to also cover the vec_to_scalar part (duplicate it to > the else case shown above and the > > if (costing_p) > { > if (n_adjacent_stores > 0) > vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores, > alignment_support_scheme, misalignment, > &inside_cost, cost_vec); > > case after the processing loop. I looked into using n_adjacent_stores for this. Currently, the use of n_adjacent_stores seems to be connected to the value of inside_cost. But is that really what is off here? I was not able to use it to make changes to how the calls to aarch64_vector_costs:add_stmt_cost proceed. They still return 0 for vec_to_scalar in the vect_body. Can you elaborate on how you envisioned the use of n_adjacent_stores in this case? Thanks, Jennifer > > Richard.
smime.p7s
Description: S/MIME cryptographic signature