On Thu, Jun 20, 2019 at 12:43 AM Uros Bizjak <ubiz...@gmail.com> wrote:
>
> On Thu, Jun 20, 2019 at 9:40 AM Uros Bizjak <ubiz...@gmail.com> wrote:
> >
> > On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > processor_costs has costs of RTL expressions and costs of moves:
> > >
> > > 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
> > > to generate RTL expressions with the lowest costs.  Costs of RTL memory
> > > operation can be very close to costs of fast instructions to indicate
> > > fast memory operations.
> > >
> > > 2. After RTL expressions have been generated, costs of moves are used by
> > > TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
> > > costs for register allocator.  Costs of load and store are higher than
> > > costs of register moves to reduce stack usages by register allocator.
> > >
> > > We should separate costs of RTL expressions from costs of moves so that
> > > they can be adjusted independently.  This patch moves costs of moves to
> > > the new used_by_ra field and duplicates costs of moves which are also
> > > used for costs of RTL expressions.
> >
> > Actually, I think that the current separation is OK. Before reload, we
> > actually don't know which register set will perform the move (not even
> > if float mode will be moved in integer registers), the only thing we
> > can estimate is the number of move instructions. The real cost of
> > register moves is later calculated by the register allocator, where
> > the register class is taken into account when calculating the cost.
>
> Forgot to say that due to the above reasoning, cost of moves should
> not be used in the calculation of costs of RTL expressions, as we are
> talking about two different cost functions. RTL expressions should
> know nothing about register classes.
>

Currently, costs of moves are also used for costs of RTL expressions.   This
patch:

https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html

includes:

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index e943d13..8409a5f 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4}, /* cost of loading integer registers
     in QImode, HImode and SImode.
     Relative to reg-reg move (2).  */
-  {6, 6, 6}, /* cost of storing integer registers */
+  {6, 6, 3}, /* cost of storing integer registers */
   2, /* cost of reg,reg fld/fst */
   {6, 6, 8}, /* cost of loading fp registers
     in SFmode, DFmode and XFmode */

It lowered the cost for SImode store and made it cheaper than SSE<->integer
register move.  It caused a regression:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878

Since the cost for SImode store is also used to compute scalar_store
in ix86_builtin_vectorization_cost, it changed loop costs in

void
foo (long p2, long *diag, long d, long i)
{
  long k;
  k = p2 < 3 ? p2 + p2 : p2 + 3;
  while (i < k)
    diag[i++] = d;
}

As the result, the loop is unrolled 4 times with -O3 -march=skylake,
instead of 3.

My patch separates costs of moves from costs of RTL expressions.  We have
a follow up patch which restores the cost for SImode store back to 6 and leave
the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
data now.

-- 
H.J.

Reply via email to