On Tue, Jul 23, 2019 at 2:57 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Mon, Jun 24, 2019 at 9:16 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Mon, Jun 24, 2019 at 6:37 AM Richard Biener <rguent...@suse.de> wrote:
> > >
> > > On Thu, 20 Jun 2019, Jan Hubicka wrote:
> > >
> > > > > > Currently, costs of moves are also used for costs of RTL 
> > > > > > expressions.   This
> > > > > > patch:
> > > > > >
> > > > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > > > > >
> > > > > > includes:
> > > > > >
> > > > > > diff --git a/gcc/config/i386/x86-tune-costs.h 
> > > > > > b/gcc/config/i386/x86-tune-costs.h
> > > > > > index e943d13..8409a5f 100644
> > > > > > --- a/gcc/config/i386/x86-tune-costs.h
> > > > > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > > > > >    {4, 4, 4}, /* cost of loading integer registers
> > > > > >      in QImode, HImode and SImode.
> > > > > >      Relative to reg-reg move (2).  */
> > > > > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > > > > +  {6, 6, 3}, /* cost of storing integer registers */
> > > > > >    2, /* cost of reg,reg fld/fst */
> > > > > >    {6, 6, 8}, /* cost of loading fp registers
> > > > > >      in SFmode, DFmode and XFmode */
> > > >
> > > > Well, it seems that the patch was fixing things on wrong spot - the
> > > > tables are intended to be mostly latency based. I think we ought to
> > > > document divergences from these including benchmarks where the change
> > > > helped. Otherwise it is very hard to figure out why the entry does not
> > > > match the reality.
> > > > > >
> > > > > > It lowered the cost for SImode store and made it cheaper than 
> > > > > > SSE<->integer
> > > > > > register move.  It caused a regression:
> > > > > >
> > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > > > > >
> > > > > > Since the cost for SImode store is also used to compute scalar_store
> > > > > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > > > > >
> > > > > > void
> > > > > > foo (long p2, long *diag, long d, long i)
> > > > > > {
> > > > > >   long k;
> > > > > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > > > > >   while (i < k)
> > > > > >     diag[i++] = d;
> > > > > > }
> > > > > >
> > > > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > > > > instead of 3.
> > > > > >
> > > > > > My patch separates costs of moves from costs of RTL expressions.  
> > > > > > We have
> > > > > > a follow up patch which restores the cost for SImode store back to 
> > > > > > 6 and leave
> > > > > > the cost of scalar_store unchanged.  It keeps loop unrolling 
> > > > > > unchanged and
> > > > > > improves powf performance in glibc by 30%.  We are collecting SPEC 
> > > > > > CPU 2017
> > > > > > data now.
> > > >
> > > > I have seen the problem with scalar_store with AMD tuning as well.
> > > > It seems to make SLP vectorizer to be happy about idea of turning
> > > > sequence of say integer tores into code which moves all the values into
> > > > AVX register and then does one vector store.
> > > >
> > > > The cost basically compare cost of N scalar stores to 1 scalar store +
> > > > vector construction. Vector construction then N*sse_op+addss.
> > > >
> > > > With testcase:
> > > >
> > > > short array[8];
> > > > test (short a,short b,short c,short d,short e,short f,short g,short h)
> > > > {
> > > >   array[0]=a;
> > > >   array[1]=b;
> > > >   array[2]=c;
> > > >   array[3]=d;
> > > >   array[4]=e;
> > > >   array[5]=f;
> > > >   array[6]=g;
> > > >   array[7]=h;
> > > > }
> > > > int iarray[8];
> > > > test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> > > > {
> > > >   iarray[0]=a;
> > > >   iarray[1]=b;
> > > >   iarray[2]=c;
> > > >   iarray[3]=d;
> > > >   iarray[4]=e;
> > > >   iarray[5]=f;
> > > >   iarray[6]=g;
> > > >   iarray[7]=h;
> > > > }
> > > >
> > > > I get the following codegen:
> > > >
> > > >
> > > > test:
> > > >         vmovd   %edi, %xmm0
> > > >         vmovd   %edx, %xmm2
> > > >         vmovd   %r8d, %xmm1
> > > >         vmovd   8(%rsp), %xmm3
> > > >         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
> > > >         vpinsrw $1, %esi, %xmm0, %xmm0
> > > >         vpinsrw $1, %ecx, %xmm2, %xmm2
> > > >         vpinsrw $1, %r9d, %xmm1, %xmm1
> > > >         vpunpckldq      %xmm2, %xmm0, %xmm0
> > > >         vpunpckldq      %xmm3, %xmm1, %xmm1
> > > >         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> > > >         vmovaps %xmm0, array(%rip)
> > > >         ret
> > > >
> > > > test2:
> > > >         vmovd   %r8d, %xmm5
> > > >         vmovd   %edx, %xmm6
> > > >         vmovd   %edi, %xmm7
> > > >         vpinsrd $1, %r9d, %xmm5, %xmm1
> > > >         vpinsrd $1, %ecx, %xmm6, %xmm3
> > > >         vpinsrd $1, %esi, %xmm7, %xmm0
> > > >         vpunpcklqdq     %xmm3, %xmm0, %xmm0
> > > >         vmovd   16(%rbp), %xmm4
> > > >         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
> > > >         vpunpcklqdq     %xmm2, %xmm1, %xmm1
> > > >         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
> > > >         vmovdqu %ymm0, iarray(%rip)
> > > >         vzeroupper
> > > >       ret
> > > >
> > > > which is about 20% slower on my skylake notebook than the
> > > > non-SLP-vectorized variant.
> > > >
> > > > I wonder if the vec_construct costs should be made more realistic.
> > > > It is computed as:
> > > >
> > > >       case vec_construct:
> > > >         {
> > > >           /* N element inserts into SSE vectors.  */
> > > >           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > > >           /* One vinserti128 for combining two SSE vectors for AVX256.  
> > > > */
> > > >           if (GET_MODE_BITSIZE (mode) == 256)
> > > >             cost += ix86_vec_cost (mode, ix86_cost->addss);
> > > >           /* One vinserti64x4 and two vinserti128 for combining SSE
> > > >              and AVX256 vectors to AVX512.  */
> > > >           else if (GET_MODE_BITSIZE (mode) == 512)
> > > >             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
> > > >           return cost;
> > > >
> > > > So it expects 8 simple SSE operations + one SSE FP arithmetical
> > > > operations.  While code above has 8 inter-unit moves + 3 SSE integer
> > > > operations to shuffle things around. Not mentioning the increased
> > > > register pressure.
> > >
> > > But aren't the inter-unit moves a red herring?  Your testcase places
> > > the sources in integer registers but usually for the case of
> > > vectorization we arrive here from strided loads for which we could
> > > load the first value into a %xmm reg directly and have the
> > > later vpinsr instruction have memory source?
> > >
> > > Yes, vec_construct cost isn't the full story in this case which is
> > > why add_stmt special-cases strided loads/stores adding some
> > > pessimization.
> > >
> > > > I would say that for integer constructs it is a common case that things
> > > > needs to be moved from integer unit to SSE.
> > >
> > > Is it?  For SLP vectorization probably yes.  The costing interface
> > > unfortunately is not giving much information here (well, add_stmt
> > > has access to the stmt_info ...).
> > >
> > > > Overall the problem is deeper since vectorizer really may need to get
> > > > better idea about latencies and throughputs to estimate loop times more
> > > > realistically.
> > >
> > > Indeed, but I hardly see how we can handle this in a sensible way since
> > > we don't even understand performance corner-cases when analyzing them
> > > and looking at this info but the HW still behaves in unexpected ways :/
> > >
> > > > One also may want to account somewhat that stores are often not part
> > > > of the hot path and thus their latency is not too critical and the
> > > > fact that vector stores prevents later partial memory stalls on the
> > > > other hand...
> > > >
> >
> > Costs of moves are closely related to latency and should only be used
> > for register allocator.   We shouldn't use costs of moves for RTL costs.
> > For register allocator, register <-> register moves are preferred over
> > load and store unless it is slower than register -> memory -> register.
> > For RTL costs,  we may want to make load and store cheap to improve
> > RTL expansion.  But we don't want to change load and store costs for
> > register allocator.   We need to separate costs of moves from costs of
> > RTL expressions first.
> >
>
> Here is the updated patch to improve register allocator and RTL
> expressions independently.
>
> Any comments?
>

PING:

https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01542.html

-- 
H.J.

Reply via email to