RE: znver3 tuning part 1

Kumar, Venkataramanan via Gcc-patches Mon, 22 Mar 2021 01:43:51 -0700

[AMD Official Use Only - Internal Distribution Only]

Hi Honza,


Thank you for working on this.  

> -----Original Message-----
> From: Gcc-patches <gcc-patches-boun...@gcc.gnu.org> On Behalf Of Jan
> Hubicka
> Sent: Monday, March 15, 2021 3:33 PM
> To: gcc-patches@gcc.gnu.org; mjam...@suse.cz
> Subject: znver3 tuning part 1
> 
> [CAUTION: External Email]
> 
> Hi,
> I plan to commit some retuning of znver3 codegen that is based on real
> hardware benchmarks.  It turns out that there are not too many changes
> necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> 
>  - some instructions (like idiv) have shorter latencies.  Adjusting
>    costs reduces code size a bit but seems within noise in benchmark
>    (since our cost calculation is quite off anyway because it does not
>    account register pressure and parallelism that does make huge
>    difference here)
>  - gather instructions are still microcoded but a lot faster than in
>    znver1/znver2 and it turns out they are now beneficial for few tsmc
>    benchmarks, so I plan to enable them.

Can we get a copy of this benchmark to try ?  
we need to check on bigger benchmarks like SPEC also. 

> 
>    It seems we missed revisiting this for znver2 tuning.
>    I think even for znver2 it may make sense to re-enable them, so I
>    will benchmark this as well.
>  - memcpy/memset expansion seems to work same way as for znver2,
>    so I am keeping same changes.
>  - instruction scheduler is already modified in trunk to some degree
>    reflecting new units.  Problem with instruction scheduling is that
>    it treats zen as in-order CPU and is unlikely going to fill all
>    execution resources this way.
>    We may want to try to model the out-of-order nature similar way as
>    LLVM does, but at the other hand the current scheduling logic seems
>    to do mostly fine (i.e. not worse than llvm's).  What matters is
>    to schedule for long latencies and just after branch boundaries
>    where simplified model seems to do just fine.

So we can keep the existing model for znver3 for GCC 11 ?

>  - some move instruction latencies does not reflect reality
>    (at least the published latencies by Agner Fog or AMD optimization
>    manual that themseleves does not agree with each otehr).
>    Adjusting tables however triggers regressions in ImageMagick and
>    parest, so I am still looking if there is easy fix for this and if
>    not, I will wait for next stage1 with these.
>    Interesting property is that reg-reg moves are a zero latency.
>    Since costs are officially relative to reg-reg move it makes it bit
>    hard to define here :)
>  - fmadd was optimized and it is now 4 cycles (was 5 and 6 cycles on
>    znver2 and znver1 respectively) like on Intel. However there is still
>    problem with extending the critical chain in matrix multiplication
>    loop.  The difference seems to be that Intel implementation needs the
>    accumulator value to be ready only 1 cycle after the execution
>    started processing the multiplication.
> 
>    So there is still a performance regression on matmul and thus I am
>    keeping the logic to break critical chains.

My observation is also same here. 

> 
> This first patch is no-op and it only copies the cost tables.  I will adjust 
> them one-
> by-one for easier hunting of possible regressions.
> 
> Honza
> 
> 2021-03-15  Jan Hubicka  <hubi...@ucw.cz>
> 
>         * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
>         * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
>         of znver2_cost.
> 
> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> index e93935f6f2c..7865bc110a3 100644
> --- a/gcc/config/i386/i386-options.c
> +++ b/gcc/config/i386/i386-options.c
> @@ -743,7 +743,7 @@ static const struct processor_costs
> *processor_cost_table[] =
>    &btver2_cost,
>    &znver1_cost,
>    &znver2_cost,
> -  &znver2_cost
> +  &znver3_cost
>  };
> 
>  /* Guarantee that the array is aligned with enum processor_type.  */ diff 
> --git
> a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index cc27c7911e3..e655e668c7a 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1688,6 +1688,140 @@ struct processor_costs znver2_cost = {
>    "16",                                        /* Func alignment.  */
>  };
> 
> +struct processor_costs znver3_cost = {
> +  {
> +  /* Start of register allocator costs.  integer->integer move cost is
> +2. */
> +
> +  /* reg-reg moves are done by renaming and thus they are even cheaper than
> +     1 cycle.  Because reg-reg move cost is 2 and following tables correspond
> +     to doubles of latencies, we do not model this correctly.  It does not
> +     seem to make practical difference to bump prices up even more.  */
> +  6,                                   /* cost for loading QImode using
> +                                          movzbl.  */
> +  {6, 6, 6},                           /* cost of loading integer registers
> +                                          in QImode, HImode and SImode.
> +                                          Relative to reg-reg move (2).  */
> +  {8, 8, 8},                           /* cost of storing integer
> +                                          registers.  */
> +  2,                                   /* cost of reg,reg fld/fst.  */
> +  {6, 6, 16},                          /* cost of loading fp registers
> +                                          in SFmode, DFmode and XFmode.  */
> +  {8, 8, 16},                          /* cost of storing fp registers
> +                                          in SFmode, DFmode and XFmode.  */
> +  2,                                   /* cost of moving MMX register.  */
> +  {6, 6},                              /* cost of loading MMX registers
> +                                          in SImode and DImode.  */
> +  {8, 8},                              /* cost of storing MMX registers
> +                                          in SImode and DImode.  */
> +  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> +                                          register.  */
> +  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +                                          in 32,64,128,256 and 512-bit.  */
> +  {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
> +                                          in 32,64,128,256 and 512-bit.  */
> +  6, 6,                                        /* SSE->integer and 
> integer->SSE
> +                                          moves.  */
> +  8, 8,                                /* mask->integer and integer->mask 
> moves */
> +  {6, 6, 6},                           /* cost of loading mask register
> +                                          in QImode, HImode, SImode.  */
> +  {8, 8, 8},                           /* cost if storing mask register
> +                                          in QImode, HImode, SImode.  */
> +  2,                                   /* cost of moving mask register.  */
> +  /* End of register allocator costs.  */  },
> +
> +  COSTS_N_INSNS (1),                   /* cost of an add instruction.  */
> +  COSTS_N_INSNS (1),                   /* cost of a lea instruction.  */
> +  COSTS_N_INSNS (1),                   /* variable shift costs.  */
> +  COSTS_N_INSNS (1),                   /* constant shift costs.  */
> +  {COSTS_N_INSNS (3),                  /* cost of starting multiply for QI.  
> */
> +   COSTS_N_INSNS (3),                  /*                               HI.  
> */
> +   COSTS_N_INSNS (3),                  /*                               SI.  
> */
> +   COSTS_N_INSNS (3),                  /*                               DI.  
> */
> +   COSTS_N_INSNS (3)},                 /*                      other.  */
> +  0,                                   /* cost of multiply per each bit
> +                                          set.  */
> +   /* Depending on parameters, idiv can get faster on ryzen.  This is upper
> +      bound.  */
> +  {COSTS_N_INSNS (16),                 /* cost of a divide/mod for QI.  */
> +   COSTS_N_INSNS (22),                 /*                          HI.  */
> +   COSTS_N_INSNS (30),                 /*                          SI.  */
> +   COSTS_N_INSNS (45),                 /*                          DI.  */
> +   COSTS_N_INSNS (45)},                        /*                          
> other.  */
> +  COSTS_N_INSNS (1),                   /* cost of movsx.  */
> +  COSTS_N_INSNS (1),                   /* cost of movzx.  */
> +  8,                                   /* "large" insn.  */
> +  9,                                   /* MOVE_RATIO.  */
> +  6,                                   /* CLEAR_RATIO */
> +  {6, 6, 6},                           /* cost of loading integer registers
> +                                          in QImode, HImode and SImode.
> +                                          Relative to reg-reg move (2).  */
> +  {8, 8, 8},                           /* cost of storing integer
> +                                          registers.  */
> +  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +                                          in 32bit, 64bit, 128bit, 256bit 
> and 512bit */
> +  {8, 8, 8, 8, 16},                    /* cost of storing SSE register
> +                                          in 32bit, 64bit, 128bit, 256bit 
> and 512bit */
> +  {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> +  {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> +  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> +                                          register.  */
> +  6,                                   /* cost of moving SSE register to 
> integer.  */
> +  /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> +     throughput 12.  Approx 9 uops do not depend on vector size and every 
> load
> +     is 7 uops.  */
> +  18, 8,                               /* Gather load static, per_elt.  */
> +  18, 10,                              /* Gather store static, per_elt.  */
> +  32,                                  /* size of l1 cache.  */
> +  512,                                 /* size of l2 cache.  */
> +  64,                                  /* size of prefetch block.  */
> +  /* New AMD processors never drop prefetches; if they cannot be performed
> +     immediately, they are queued.  We set number of simultaneous prefetches
> +     to a large constant to reflect this (it probably is not a good idea not
> +     to limit number of prefetches at all, as their execution also takes some
> +     time).  */
> +  100,                                 /* number of parallel prefetches.  */
> +  3,                                   /* Branch cost.  */
> +  COSTS_N_INSNS (5),                   /* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMUL instruction.  */
> +  /* Latency of fdiv is 8-15.  */
> +  COSTS_N_INSNS (15),                  /* cost of FDIV instruction.  */
> +  COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
> +  COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
> +  /* Latency of fsqrt is 4-10.  */
> +  COSTS_N_INSNS (10),                  /* cost of FSQRT instruction.  */
> +
> +  COSTS_N_INSNS (1),                   /* cost of cheap SSE instruction.  */
> +  COSTS_N_INSNS (3),                   /* cost of ADDSS/SD SUBSS/SD insns.  
> */
> +  COSTS_N_INSNS (3),                   /* cost of MULSS instruction.  */
> +  COSTS_N_INSNS (3),                   /* cost of MULSD instruction.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMA SS instruction.  */
> +  COSTS_N_INSNS (5),                   /* cost of FMA SD instruction.  */
> +  COSTS_N_INSNS (10),                  /* cost of DIVSS instruction.  */
> +  /* 9-13.  */
> +  COSTS_N_INSNS (13),                  /* cost of DIVSD instruction.  */
> +  COSTS_N_INSNS (10),                  /* cost of SQRTSS instruction.  */
> +  COSTS_N_INSNS (15),                  /* cost of SQRTSD instruction.  */
> +  /* Zen can execute 4 integer operations per cycle.  FP operations
> +     take 3 cycles and it can execute 2 integer additions and 2
> +     multiplications thus reassociation may make sense up to with of 6.
> +     SPEC2k6 bencharks suggests
> +     that 4 works better than 6 probably due to register pressure.
> +
> +     Integer vector operations are taken by FP unit and execute 3 vector
> +     plus/minus operations per cycle but only one multiply.  This is adjusted
> +     in ix86_reassociation_width.  */
> +  4, 4, 3, 6,                          /* reassoc int, fp, vec_int, vec_fp.  
> */
> +  znver2_memcpy,
> +  znver2_memset,
> +  COSTS_N_INSNS (4),                   /* cond_taken_branch_cost.  */
> +  COSTS_N_INSNS (2),                   /* cond_not_taken_branch_cost.  */
> +  "16",                                        /* Loop alignment.  */
> +  "16",                                        /* Jump alignment.  */
> +  "0:0:8",                             /* Label alignment.  */
> +  "16",                                        /* Func alignment.  */
> +};
> +
>  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
>  static stringop_algs skylake_memcpy[2] =   {
>    {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},


Regards,
Venkat.

RE: znver3 tuning part 1

Reply via email to