Re: znver3 tuning part 1

Martin Liška Mon, 22 Mar 2021 02:18:36 -0700

On 3/22/21 9:43 AM, Kumar, Venkataramanan via Gcc-patches wrote:

[AMD Official Use Only - Internal Distribution Only]


Just brief note about this. You CC a public mailing list, thus the disclaimer 
does not
make much sense. Please take a look here: 
https://gcc.gnu.org/lists.html#policies


Hi Honza,

Thank you for working on this.

-----Original Message-----
From: Gcc-patches <gcc-patches-boun...@gcc.gnu.org> On Behalf Of Jan
Hubicka
Sent: Monday, March 15, 2021 3:33 PM
To: gcc-patches@gcc.gnu.org; mjam...@suse.cz
Subject: znver3 tuning part 1

[CAUTION: External Email]

Hi,
I plan to commit some retuning of znver3 codegen that is based on real
hardware benchmarks.  It turns out that there are not too many changes
necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:

  - some instructions (like idiv) have shorter latencies.  Adjusting
    costs reduces code size a bit but seems within noise in benchmark
    (since our cost calculation is quite off anyway because it does not
    account register pressure and parallelism that does make huge
    difference here)
  - gather instructions are still microcoded but a lot faster than in
    znver1/znver2 and it turns out they are now beneficial for few tsmc
    benchmarks, so I plan to enable them.


Can we get a copy of this benchmark to try ?


Sure: https://github.com/UoB-HPC/TSVC_2

Cheers,
Martin

we need to check on bigger benchmarks like SPEC also.


    It seems we missed revisiting this for znver2 tuning.
    I think even for znver2 it may make sense to re-enable them, so I
    will benchmark this as well.
  - memcpy/memset expansion seems to work same way as for znver2,
    so I am keeping same changes.
  - instruction scheduler is already modified in trunk to some degree
    reflecting new units.  Problem with instruction scheduling is that
    it treats zen as in-order CPU and is unlikely going to fill all
    execution resources this way.
    We may want to try to model the out-of-order nature similar way as
    LLVM does, but at the other hand the current scheduling logic seems
    to do mostly fine (i.e. not worse than llvm's).  What matters is
    to schedule for long latencies and just after branch boundaries
    where simplified model seems to do just fine.


So we can keep the existing model for znver3 for GCC 11 ?

  - some move instruction latencies does not reflect reality
    (at least the published latencies by Agner Fog or AMD optimization
    manual that themseleves does not agree with each otehr).
    Adjusting tables however triggers regressions in ImageMagick and
    parest, so I am still looking if there is easy fix for this and if
    not, I will wait for next stage1 with these.
    Interesting property is that reg-reg moves are a zero latency.
    Since costs are officially relative to reg-reg move it makes it bit
    hard to define here :)
  - fmadd was optimized and it is now 4 cycles (was 5 and 6 cycles on
    znver2 and znver1 respectively) like on Intel. However there is still
    problem with extending the critical chain in matrix multiplication
    loop.  The difference seems to be that Intel implementation needs the
    accumulator value to be ready only 1 cycle after the execution
    started processing the multiplication.

    So there is still a performance regression on matmul and thus I am
    keeping the logic to break critical chains.


My observation is also same here.


This first patch is no-op and it only copies the cost tables.  I will adjust 
them one-
by-one for easier hunting of possible regressions.

Honza

2021-03-15  Jan Hubicka  <hubi...@ucw.cz>

         * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
         * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
         of znver2_cost.

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index e93935f6f2c..7865bc110a3 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -743,7 +743,7 @@ static const struct processor_costs
*processor_cost_table[] =
    &btver2_cost,
    &znver1_cost,
    &znver2_cost,
-  &znver2_cost
+  &znver3_cost
  };

  /* Guarantee that the array is aligned with enum processor_type.  */ diff 
--git
a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index cc27c7911e3..e655e668c7a 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1688,6 +1688,140 @@ struct processor_costs znver2_cost = {
    "16",                                        /* Func alignment.  */
  };

+struct processor_costs znver3_cost = {
+  {
+  /* Start of register allocator costs.  integer->integer move cost is
+2. */
+
+  /* reg-reg moves are done by renaming and thus they are even cheaper than
+     1 cycle.  Because reg-reg move cost is 2 and following tables correspond
+     to doubles of latencies, we do not model this correctly.  It does not
+     seem to make practical difference to bump prices up even more.  */
+  6,                                   /* cost for loading QImode using
+                                          movzbl.  */
+  {6, 6, 6},                           /* cost of loading integer registers
+                                          in QImode, HImode and SImode.
+                                          Relative to reg-reg move (2).  */
+  {8, 8, 8},                           /* cost of storing integer
+                                          registers.  */
+  2,                                   /* cost of reg,reg fld/fst.  */
+  {6, 6, 16},                          /* cost of loading fp registers
+                                          in SFmode, DFmode and XFmode.  */
+  {8, 8, 16},                          /* cost of storing fp registers
+                                          in SFmode, DFmode and XFmode.  */
+  2,                                   /* cost of moving MMX register.  */
+  {6, 6},                              /* cost of loading MMX registers
+                                          in SImode and DImode.  */
+  {8, 8},                              /* cost of storing MMX registers
+                                          in SImode and DImode.  */
+  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
+                                          register.  */
+  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
+                                          in 32,64,128,256 and 512-bit.  */
+  {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
+                                          in 32,64,128,256 and 512-bit.  */
+  6, 6,                                        /* SSE->integer and integer->SSE
+                                          moves.  */
+  8, 8,                                /* mask->integer and integer->mask 
moves */
+  {6, 6, 6},                           /* cost of loading mask register
+                                          in QImode, HImode, SImode.  */
+  {8, 8, 8},                           /* cost if storing mask register
+                                          in QImode, HImode, SImode.  */
+  2,                                   /* cost of moving mask register.  */
+  /* End of register allocator costs.  */  },
+
+  COSTS_N_INSNS (1),                   /* cost of an add instruction.  */
+  COSTS_N_INSNS (1),                   /* cost of a lea instruction.  */
+  COSTS_N_INSNS (1),                   /* variable shift costs.  */
+  COSTS_N_INSNS (1),                   /* constant shift costs.  */
+  {COSTS_N_INSNS (3),                  /* cost of starting multiply for QI.  */
+   COSTS_N_INSNS (3),                  /*                               HI.  */
+   COSTS_N_INSNS (3),                  /*                               SI.  */
+   COSTS_N_INSNS (3),                  /*                               DI.  */
+   COSTS_N_INSNS (3)},                 /*                      other.  */
+  0,                                   /* cost of multiply per each bit
+                                          set.  */
+   /* Depending on parameters, idiv can get faster on ryzen.  This is upper
+      bound.  */
+  {COSTS_N_INSNS (16),                 /* cost of a divide/mod for QI.  */
+   COSTS_N_INSNS (22),                 /*                          HI.  */
+   COSTS_N_INSNS (30),                 /*                          SI.  */
+   COSTS_N_INSNS (45),                 /*                          DI.  */
+   COSTS_N_INSNS (45)},                        /*                          
other.  */
+  COSTS_N_INSNS (1),                   /* cost of movsx.  */
+  COSTS_N_INSNS (1),                   /* cost of movzx.  */
+  8,                                   /* "large" insn.  */
+  9,                                   /* MOVE_RATIO.  */
+  6,                                   /* CLEAR_RATIO */
+  {6, 6, 6},                           /* cost of loading integer registers
+                                          in QImode, HImode and SImode.
+                                          Relative to reg-reg move (2).  */
+  {8, 8, 8},                           /* cost of storing integer
+                                          registers.  */
+  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
+                                          in 32bit, 64bit, 128bit, 256bit and 
512bit */
+  {8, 8, 8, 8, 16},                    /* cost of storing SSE register
+                                          in 32bit, 64bit, 128bit, 256bit and 
512bit */
+  {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
+  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
+                                          register.  */
+  6,                                   /* cost of moving SSE register to 
integer.  */
+  /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
+     throughput 12.  Approx 9 uops do not depend on vector size and every load
+     is 7 uops.  */
+  18, 8,                               /* Gather load static, per_elt.  */
+  18, 10,                              /* Gather store static, per_elt.  */
+  32,                                  /* size of l1 cache.  */
+  512,                                 /* size of l2 cache.  */
+  64,                                  /* size of prefetch block.  */
+  /* New AMD processors never drop prefetches; if they cannot be performed
+     immediately, they are queued.  We set number of simultaneous prefetches
+     to a large constant to reflect this (it probably is not a good idea not
+     to limit number of prefetches at all, as their execution also takes some
+     time).  */
+  100,                                 /* number of parallel prefetches.  */
+  3,                                   /* Branch cost.  */
+  COSTS_N_INSNS (5),                   /* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (5),                   /* cost of FMUL instruction.  */
+  /* Latency of fdiv is 8-15.  */
+  COSTS_N_INSNS (15),                  /* cost of FDIV instruction.  */
+  COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
+  COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
+  /* Latency of fsqrt is 4-10.  */
+  COSTS_N_INSNS (10),                  /* cost of FSQRT instruction.  */
+
+  COSTS_N_INSNS (1),                   /* cost of cheap SSE instruction.  */
+  COSTS_N_INSNS (3),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
+  COSTS_N_INSNS (3),                   /* cost of MULSS instruction.  */
+  COSTS_N_INSNS (3),                   /* cost of MULSD instruction.  */
+  COSTS_N_INSNS (5),                   /* cost of FMA SS instruction.  */
+  COSTS_N_INSNS (5),                   /* cost of FMA SD instruction.  */
+  COSTS_N_INSNS (10),                  /* cost of DIVSS instruction.  */
+  /* 9-13.  */
+  COSTS_N_INSNS (13),                  /* cost of DIVSD instruction.  */
+  COSTS_N_INSNS (10),                  /* cost of SQRTSS instruction.  */
+  COSTS_N_INSNS (15),                  /* cost of SQRTSD instruction.  */
+  /* Zen can execute 4 integer operations per cycle.  FP operations
+     take 3 cycles and it can execute 2 integer additions and 2
+     multiplications thus reassociation may make sense up to with of 6.
+     SPEC2k6 bencharks suggests
+     that 4 works better than 6 probably due to register pressure.
+
+     Integer vector operations are taken by FP unit and execute 3 vector
+     plus/minus operations per cycle but only one multiply.  This is adjusted
+     in ix86_reassociation_width.  */
+  4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  */
+  znver2_memcpy,
+  znver2_memset,
+  COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
+  COSTS_N_INSNS (2),                   /* cond_not_taken_branch_cost.  */
+  "16",                                        /* Loop alignment.  */
+  "16",                                        /* Jump alignment.  */
+  "0:0:8",                             /* Label alignment.  */
+  "16",                                        /* Func alignment.  */
+};
+
  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
  static stringop_algs skylake_memcpy[2] =   {
    {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},



Regards,
Venkat.

Re: znver3 tuning part 1

Reply via email to