On Mon, Sep 18, 2023 at 10:41 AM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Kewen Lin <li...@linux.ibm.com> writes:
> > This costing adjustment patch series exposes one issue in
> > aarch64 specific costing adjustment for STP sequence.  It
> > causes the below test cases to fail:
> >
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
> >   - gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
> >
> > Take the below function extracted from ldp_stp_15.c as
> > example:
> >
> > void
> > dup_8_int32_t (int32_t *x, int32_t val)
> > {
> >     for (int i = 0; i < 8; ++i)
> >           x[i] = val;
> > }
> >
> > Without my patch series, during slp1 it gets:
> >
> >   val_8(D) 2 times unaligned_store (misalign -1) costs 2 in body
> >   node 0x10008c85e38 1 times scalar_to_vec costs 1 in prologue
> >
> > then the final vector cost is 3.
> >
> > With my patch series, during slp1 it gets:
> >
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   val_8(D) 1 times unaligned_store (misalign -1) costs 1 in body
> >   node 0x10004cc5d88 1 times scalar_to_vec costs 1 in prologue
> >
> > but the final vector cost is 17.  The unaligned_store count is
> > actually unchanged, but the final vector costs become different,
> > it's because the below aarch64 special handling makes the
> > different costs:
> >
> >   /* Apply the heuristic described above m_stp_sequence_cost.  */
> >   if (m_stp_sequence_cost != ~0U)
> >     {
> >       uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> >                                                stmt_info, vectype);
> >       m_stp_sequence_cost = MIN (m._stp_sequence_cost + cost, ~0U);
> >     }
> >
> > For the former, since the count is 2, function
> > aarch64_stp_sequence_cost returns 2 as "CEIL (count, 2) * 2".
> > While for the latter, it's separated into twice calls with
> > count 1, aarch64_stp_sequence_cost returns 2 for each time,
> > so it returns 4 in total.
> >
> > For this case, the stmt with scalar_to_vec also contributes
> > 4 to m_stp_sequence_cost, then the final m_stp_sequence_cost
> > are 6 (2+4) vs. 8 (4+4).
> >
> > Considering scalar_costs->m_stp_sequence_cost is 8 and below
> > checking and re-assigning:
> >
> >   else if (m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> >     m_costs[vect_body] = 2 * scalar_costs->total_cost ();
> >
> > For the former, the body cost of vector isn't changed; but
> > for the latter, the body cost of vector is double of scalar
> > cost which is 8 for this case, then it becomes 16 which is
> > bigger than what we expect.
> >
> > I'm not sure why it adopts CEIL for the return value for
> > case unaligned_store in function aarch64_stp_sequence_cost,
> > but I tried to modify it with "return count;" (as it can
> > get back to previous cost), there is no failures exposed
> > in regression testing.  I expected that if the previous
> > unaligned_store count is even, this adjustment doesn't
> > change anything, if it's odd, the adjustment may reduce
> > it by one, but I'd guess it would be few.  Besides, as
> > the comments for m_stp_sequence_cost, the current
> > handlings seems temporary, maybe a tweak like this can be
> > accepted, so I posted this RFC/PATCH to request comments.
> > this one line change is considered.
>
> It's unfortunate that doing this didn't show up a regression.
> I guess it's not a change we explicitly added tests to guard against.
>
> But the point of the condition is to estimate how many single stores
> (STRs) and how many paired stores (STPs) would be generated.  As far
> as this heuristic goes, STP (storing two values) is as cheap as STR
> (storing only one value).  So the point of the CEIL is to count 1 store
> as having equal cost to 2, 3 as having equal cost to 4, etc.
>
> For a heuristic like that, costing a vector stmt once with count 2
> is different from costing 2 vector stmts with count 1.  The former
> makes it obvious that the 2 vector stmts are associated with the
> same scalar stmt, and are highly likely to be consecutive.  The latter
> (costing 2 stmts with count 1) could also happen for unrelated stmts.
>
> ISTM that costing once with count N provides strictly more information
> to targets than costing N time with count 1.  Is there no way we can
> keep the current behaviour?  E.g. rather than costing a stmt immediately
> within a loop, could we just increment a counter and cost once at the end?

I suppose we can.  But isn't the logic currently (or before the series) cheated
for variable-strided stores with ncopies > 1?  That is, while it sounds like
reasonable heuristics you can't really rely on this as the vectorizer doesn't
currently provide the info whether two vector loads/stores are adjacent?

Making sure we only pass count > 1 for adjacent load/store would be possible
though.  We should document this with comments though.

Richard.

>
> Thanks,
> Richard
>
> > gcc/ChangeLog:
> >
> >       * config/aarch64/aarch64.cc (aarch64_stp_sequence_cost): Return
> >       count directly instead of the adjusted value computed with CEIL.
> > ---
> >  gcc/config/aarch64/aarch64.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 37d414021ca..9fb4fbd883d 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17051,7 +17051,7 @@ aarch64_stp_sequence_cost (unsigned int count, 
> > vect_cost_for_stmt kind,
> >         if (!aarch64_aligned_constant_offset_p (stmt_info, size))
> >           return count * 2;
> >       }
> > -      return CEIL (count, 2) * 2;
> > +      return count;
> >
> >      case scalar_store:
> >        if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))

Reply via email to