On Tue, Sep 3, 2019 at 9:57 AM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Mon, Sep 2, 2019 at 4:41 PM Uros Bizjak <ubiz...@gmail.com> wrote:
> >
> > On Mon, Sep 2, 2019 at 10:13 AM Hongtao Liu <crazy...@gmail.com> wrote:
> > >
> > > > which is not the case with core_cost (and similar with skylake_cost):
> > > >
> > > >   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
> > > >   {6, 6, 6, 6, 12},            /* cost of loading SSE registers
> > > >                        in 32,64,128,256 and 512-bit */
> > > >   {6, 6, 6, 6, 12},            /* cost of storing SSE registers
> > > >                        in 32,64,128,256 and 512-bit */
> > > >   2, 2,                    /* SSE->integer and integer->SSE moves */
> > > >
> > > > We have the same cost of moving between integer registers (by default
> > > > set to 2), between SSE registers and between integer and SSE register
> > > > sets. I think that at least the cost of moves between regsets should
> > > > be substantially higher, rs6000 uses 3x cost of intra-regset moves;
> > > > that would translate to the value of 6. The value should be low enough
> > > > to keep the cost below the value that forces move through the memory.
> > > > Changing core register allocation cost of SSE <-> integer to:
> > > >
> > > > --cut here--
> > > > Index: config/i386/x86-tune-costs.h
> > > > ===================================================================
> > > > --- config/i386/x86-tune-costs.h        (revision 275281)
> > > > +++ config/i386/x86-tune-costs.h        (working copy)
> > > > @@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
> > > >                                            in 32,64,128,256 and 512-bit 
> > > > */
> > > >    {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
> > > >                                            in 32,64,128,256 and 512-bit 
> > > > */
> > > > -  2, 2,                                        /* SSE->integer and
> > > > integer->SSE moves */
> > > > +  6, 6,                                        /* SSE->integer and
> > > > integer->SSE moves */
> > > >    /* End of register allocator costs.  */
> > > >    },
> > > >
> > > > --cut here--
> > > >
> > > > still produces direct move in gcc.target/i386/minmax-6.c
> > > >
> > > > I think that in addition to attached patch, values between 2 and 6
> > > > should be considered in benchmarking. Unfortunately, without access to
> > > > regressed SPEC tests, I can't analyse these changes by myself.
> > > >
> > > > Uros.
> > >
> > > Apply similar change to skylake_cost, on skylake workstation we got
> > > performance like:
> > > ---------------------------
> > > version                                                            |
> > > 548_exchange_r score
> > > gcc10_20180822:                                           |   10
> > > apply remove_max8                                       |   8.9
> > > also apply increase integer_tofrom_sse cost |   9.69
> > > -----------------------------
> > > Still 3% regression which is related to _gfortran_mminloc0_4_i4 in
> > > libgfortran.so.5.0.0.
> > >
> > > I found suspicious code as bellow, does it affect?
> >
> > Hard to say without access to the test, but I'm glad that changing the
> > knob has noticeable effect. I think that (as said by Alan) a fine-tune
> > of register pressure calculation will be needed to push this forward.
> >
> > Uros.
> >
> > > ------------------
> > > modified   gcc/config/i386/i386-features.c
> > > @@ -590,7 +590,7 @@ general_scalar_chain::compute_convert_gain ()
> > >    if (dump_file)
> > >      fprintf (dump_file, "  Instruction conversion gain: %d\n", gain);
> > >
> > > -  /* ???  What about integer to SSE?  */
> > > +  /* ???  What about integer to SSE?  */???
> > >    EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> > >      cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
> > > ------------------
> > > --
> > > BR,
> > > Hongtao
>
> Note:
> Removing limit of cost would introduce lots of regressions in SPEC2017 as 
> follow
> --------------------------------
> 531.deepsjeng_r  -7.18%
> 548.exchange_r  -6.70%
> 557.xz_r -6.74%
> 508.namd_r -2.81%
> 527.cam4_r -6.48%
> 544.nab_r -3.99%
>
> Tested on skylake server.
> -------------------------------------
> How about  changing cost from 2 to 8 until we figure out a better number.

Certainly works for me.  Note the STV code uses the "other" sse_to_integer
number and the numbers in question here are those for the RA.  There's
a multitude of values used in the tables here, including some a lot larger.
So the overall bumping to 8 certainly was the wrong thing to do and instead
individual numbers should have been adjusted (didn't look at the history
of that bumping).  For example Pentium4 has quite high bases for move
costs, like xmm <-> xmm move costing 12 and SSE->integer costing 20
while the opposite 12.

So yes, we want to revert the patch by applying its effect to the
individual cost tables so we can revisit this for the still interesting
micro-architectures.

Richard.

> --
> BR,
> Hongtao

Reply via email to