Re: [PATCH] i386: Separate costs of RTL expressions from costs of moves

H.J. Lu Mon, 24 Jun 2019 09:17:02 -0700

On Mon, Jun 24, 2019 at 6:37 AM Richard Biener <rguent...@suse.de> wrote:
>
> On Thu, 20 Jun 2019, Jan Hubicka wrote:
>
> > > > Currently, costs of moves are also used for costs of RTL expressions.   
> > > > This
> > > > patch:
> > > >
> > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > > >
> > > > includes:
> > > >
> > > > diff --git a/gcc/config/i386/x86-tune-costs.h 
> > > > b/gcc/config/i386/x86-tune-costs.h
> > > > index e943d13..8409a5f 100644
> > > > --- a/gcc/config/i386/x86-tune-costs.h
> > > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > > >    {4, 4, 4}, /* cost of loading integer registers
> > > >      in QImode, HImode and SImode.
> > > >      Relative to reg-reg move (2).  */
> > > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > > +  {6, 6, 3}, /* cost of storing integer registers */
> > > >    2, /* cost of reg,reg fld/fst */
> > > >    {6, 6, 8}, /* cost of loading fp registers
> > > >      in SFmode, DFmode and XFmode */
> >
> > Well, it seems that the patch was fixing things on wrong spot - the
> > tables are intended to be mostly latency based. I think we ought to
> > document divergences from these including benchmarks where the change
> > helped. Otherwise it is very hard to figure out why the entry does not
> > match the reality.
> > > >
> > > > It lowered the cost for SImode store and made it cheaper than 
> > > > SSE<->integer
> > > > register move.  It caused a regression:
> > > >
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > > >
> > > > Since the cost for SImode store is also used to compute scalar_store
> > > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > > >
> > > > void
> > > > foo (long p2, long *diag, long d, long i)
> > > > {
> > > >   long k;
> > > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > > >   while (i < k)
> > > >     diag[i++] = d;
> > > > }
> > > >
> > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > > instead of 3.
> > > >
> > > > My patch separates costs of moves from costs of RTL expressions.  We 
> > > > have
> > > > a follow up patch which restores the cost for SImode store back to 6 
> > > > and leave
> > > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged 
> > > > and
> > > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 
> > > > 2017
> > > > data now.
> >
> > I have seen the problem with scalar_store with AMD tuning as well.
> > It seems to make SLP vectorizer to be happy about idea of turning
> > sequence of say integer tores into code which moves all the values into
> > AVX register and then does one vector store.
> >
> > The cost basically compare cost of N scalar stores to 1 scalar store +
> > vector construction. Vector construction then N*sse_op+addss.
> >
> > With testcase:
> >
> > short array[8];
> > test (short a,short b,short c,short d,short e,short f,short g,short h)
> > {
> >   array[0]=a;
> >   array[1]=b;
> >   array[2]=c;
> >   array[3]=d;
> >   array[4]=e;
> >   array[5]=f;
> >   array[6]=g;
> >   array[7]=h;
> > }
> > int iarray[8];
> > test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> > {
> >   iarray[0]=a;
> >   iarray[1]=b;
> >   iarray[2]=c;
> >   iarray[3]=d;
> >   iarray[4]=e;
> >   iarray[5]=f;
> >   iarray[6]=g;
> >   iarray[7]=h;
> > }
> >
> > I get the following codegen:
> >
> >
> > test:
> >         vmovd   %edi, %xmm0
> >         vmovd   %edx, %xmm2
> >         vmovd   %r8d, %xmm1
> >         vmovd   8(%rsp), %xmm3
> >         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
> >         vpinsrw $1, %esi, %xmm0, %xmm0
> >         vpinsrw $1, %ecx, %xmm2, %xmm2
> >         vpinsrw $1, %r9d, %xmm1, %xmm1
> >         vpunpckldq      %xmm2, %xmm0, %xmm0
> >         vpunpckldq      %xmm3, %xmm1, %xmm1
> >         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> >         vmovaps %xmm0, array(%rip)
> >         ret
> >
> > test2:
> >         vmovd   %r8d, %xmm5
> >         vmovd   %edx, %xmm6
> >         vmovd   %edi, %xmm7
> >         vpinsrd $1, %r9d, %xmm5, %xmm1
> >         vpinsrd $1, %ecx, %xmm6, %xmm3
> >         vpinsrd $1, %esi, %xmm7, %xmm0
> >         vpunpcklqdq     %xmm3, %xmm0, %xmm0
> >         vmovd   16(%rbp), %xmm4
> >         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
> >         vpunpcklqdq     %xmm2, %xmm1, %xmm1
> >         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
> >         vmovdqu %ymm0, iarray(%rip)
> >         vzeroupper
> >       ret
> >
> > which is about 20% slower on my skylake notebook than the
> > non-SLP-vectorized variant.
> >
> > I wonder if the vec_construct costs should be made more realistic.
> > It is computed as:
> >
> >       case vec_construct:
> >         {
> >           /* N element inserts into SSE vectors.  */
> >           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> >           /* One vinserti128 for combining two SSE vectors for AVX256.  */
> >           if (GET_MODE_BITSIZE (mode) == 256)
> >             cost += ix86_vec_cost (mode, ix86_cost->addss);
> >           /* One vinserti64x4 and two vinserti128 for combining SSE
> >              and AVX256 vectors to AVX512.  */
> >           else if (GET_MODE_BITSIZE (mode) == 512)
> >             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
> >           return cost;
> >
> > So it expects 8 simple SSE operations + one SSE FP arithmetical
> > operations.  While code above has 8 inter-unit moves + 3 SSE integer
> > operations to shuffle things around. Not mentioning the increased
> > register pressure.
>
> But aren't the inter-unit moves a red herring?  Your testcase places
> the sources in integer registers but usually for the case of
> vectorization we arrive here from strided loads for which we could
> load the first value into a %xmm reg directly and have the
> later vpinsr instruction have memory source?
>
> Yes, vec_construct cost isn't the full story in this case which is
> why add_stmt special-cases strided loads/stores adding some
> pessimization.
>
> > I would say that for integer constructs it is a common case that things
> > needs to be moved from integer unit to SSE.
>
> Is it?  For SLP vectorization probably yes.  The costing interface
> unfortunately is not giving much information here (well, add_stmt
> has access to the stmt_info ...).
>
> > Overall the problem is deeper since vectorizer really may need to get
> > better idea about latencies and throughputs to estimate loop times more
> > realistically.
>
> Indeed, but I hardly see how we can handle this in a sensible way since
> we don't even understand performance corner-cases when analyzing them
> and looking at this info but the HW still behaves in unexpected ways :/
>
> > One also may want to account somewhat that stores are often not part
> > of the hot path and thus their latency is not too critical and the
> > fact that vector stores prevents later partial memory stalls on the
> > other hand...
> >


Costs of moves are closely related to latency and should only be used
for register allocator.   We shouldn't use costs of moves for RTL costs.
For register allocator, register <-> register moves are preferred over
load and store unless it is slower than register -> memory -> register.
For RTL costs,  we may want to make load and store cheap to improve
RTL expansion.  But we don't want to change load and store costs for
register allocator.   We need to separate costs of moves from costs of
RTL expressions first.

-- 
H.J.

Re: [PATCH] i386: Separate costs of RTL expressions from costs of moves

Reply via email to