On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kewen Lin <li...@linux.ibm.com> writes: > > This costing adjustment patch series exposes one issue in > > aarch64 specific costing adjustment for STP sequence. It > > causes the below test cases to fail: > > > > - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c > > - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c > > - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c > > - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c > > > > Take the below function extracted from ldp_stp_15.c as > > example: > > > > void > > dup_8_int32_t (int32_t *x, int32_t val) > > { > > for (int i = 0; i < 8; ++i) > > x[i] = val; > > } > > > > Without my patch series, during slp1 it gets: > > > > val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body > > node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue > > > > then the final vector cost is 3. > > > > With my patch series, during slp1 it gets: > > > > val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body > > val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body > > node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue > > > > but the final vector cost is 17. The unaligned_store count is > > actually unchanged, but the final vector costs become different, > > it's because the below aarch64 special handling makes the > > different costs: > > > > /* Apply the heuristic described above m_stp_sequence_cost. */ > > if (m_stp_sequence_cost != ~0U) > > { > > uint64_t cost = aarch64_stp_sequence_cost (count, kind, > > stmt_info, vectype); > > m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U); > > } > > > > For the former, since the count is 2, function > > aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2". > > While for the latter, it's separated into twice calls with > > count 1, aarch64_stp_sequence_cost returns 2 for each time, > > so it returns 4 in total. > > > > For this case, the stmt with scalar_to_vec also contributes > > 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost > > are 6 (2+4) vs. 8 (4+4). > > > > Considering scalar_costs->m_stp_sequence_cost is 8 and below > > checking and re-assigning: > > > > else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost) > > m_costs[vect_body] = 2 * scalar_costs->total_cost (); > > > > For the former, the body cost of vector isn't changed; but > > for the latter, the body cost of vector is double of scalar > > cost which is 8 for this case, then it becomes 16 which is > > bigger than what we expect. > > > > I'm not sure why it adopts CEIL for the return value for > > case unaligned_store in function aarch64_stp_sequence_cost, > > but I tried to modify it with "return count;" (as it can > > get back to previous cost), there is no failures exposed > > in regression testing. I expected that if the previous > > unaligned_store count is even, this adjustment doesn't > > change anything, if it's odd, the adjustment may reduce > > it by one, but I'd guess it would be few. Besides, as > > the comments for m_stp_sequence_cost, the current > > handlings seems temporary, maybe a tweak like this can be > > accepted, so I posted this RFC/PATCH to request comments. > > this one line change is considered. > > It's unfortunate that doing this didn't show up a regression. > I guess it's not a change we explicitly added tests to guard against. > > But the point of the condition is to estimate how many single stores > (STRs) and how many paired stores (STPs) would be generated. As far > as this heuristic goes, STP (storing two values) is as cheap as STR > (storing only one value). So the point of the CEIL is to count 1 store > as having equal cost to 2, 3 as having equal cost to 4, etc. > > For a heuristic like that, costing a vector stmt once with count 2 > is different from costing 2 vector stmts with count 1. The former > makes it obvious that the 2 vector stmts are associated with the > same scalar stmt, and are highly likely to be consecutive. The latter > (costing 2 stmts with count 1) could also happen for unrelated stmts. > > ISTM that costing once with count N provides strictly more information > to targets than costing N time with count 1. Is there no way we can > keep the current behaviour? E.g. rather than costing a stmt immediately > within a loop, could we just increment a counter and cost once at the end?
I suppose we can. But isn't the logic currently (or before the series) cheated for variable-strided stores with ncopies > 1? That is, while it sounds like reasonable heuristics you can't really rely on this as the vectorizer doesn't currently provide the info whether two vector loads/stores are adjacent? Making sure we only pass count > 1 for adjacent load/store would be possible though. We should document this with comments though. Richard. > > Thanks, > Richard > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return > > count directly instead of the adjusted value computed with CEIL. > > --- > > gcc/config/aarch64/aarch64.cc | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index 37d414021ca..9fb4fbd883d 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, > > vect_cost_for_stmt kind, > > if (!aarch64_aligned_constant_offset_p (stmt_info, size)) > > return count * 2; > > } > > - return CEIL (count, 2) * 2; > > + return count; > > > > case scalar_store: > > if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))