On 11/6/24 12:11, Vineet Gupta wrote:
> changes since v1
>   * Changed target hook to --param
>   * squash addon patch for RISC-V opting-in, testcase here
>   * updated changelog with latest perf numbers

ping !

> ---
>
> sched1 computes ECC (Excess Change Cost) for each insn, which represents
> the register pressure attributed to the insn.
> Currently the pressure sensitive schduling algorithm deliberately ignores
> negative values (pressure reduction), making them 0 (neutral), leading
> to more spills. This happens due to the assumption that the compiler has
> a reasonably accurate processor pipeline scheduling model and thus tries
> to aggresively fill pipeline bubbles with spill slots.
>
> This however might not be true, as the model might not be available for
> certains uarches or even applicable especially for modern out-of-order cores.
>
> The existing heuristic induces spill frenzy on RISC-V, noticably so on
> SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
> total dynamic icounts for this workload are reduced in half from
> ~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).
>
> This patch adds --param=cycle-accurate-model={0,1} to gate the spill
> behavior.
>
>  - The default (1) preserves existing spill behavior.
>
>  - targets/uarches sensitive to spilling can override the param to (0)
>    to get the reverse effect. RISC-V backend does so too.
>
> The actual perf numbers are very promising.
>
> (1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:
>
>   Before:
>   ------
>   Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
>       4,917,712.97 msec task-clock:u                     #    1.000 CPUs 
> utilized
>              5,314      context-switches:u               #    1.081 /sec
>                  3      cpu-migrations:u                 #    0.001 /sec
>            204,784      page-faults:u                    #   41.642 /sec
>  7,868,291,222,513      cycles:u                         #    1.600 GHz
>  2,615,069,866,153      instructions:u                   #    0.33  insn per 
> cycle
>     10,799,381,890      branches:u                       #    2.196 M/sec
>         15,714,572      branch-misses:u                  #    0.15% of all 
> branches
>
>   After:
>   -----
>   Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
>       4,552,979.58 msec task-clock:u                     #    0.998 CPUs 
> utilized
>            205,020      context-switches:u               #   45.030 /sec
>                  2      cpu-migrations:u                 #    0.000 /sec
>            204,221      page-faults:u                    #   44.854 /sec
>  7,285,176,204,764      cycles:u        (7.4% faster)    #    1.600 GHz
>  2,145,284,345,397      instructions:u (17.96% fewer)    #    0.29  insn per 
> cycle
>     10,799,382,011      branches:u                       #    2.372 M/sec
>         16,235,628      branch-misses:u                  #    0.15% of all 
> branches
>
> (2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.
>
> gcc/ChangeLog:
>       PR target/11472
>       * params.opt (--param=cycle-accurate-model=): New opt.
>       * doc/invoke.texi (cycle-accurate-model): Document.
>       * haifa-sched.cc (model_excess_group_cost): Return negative
>       delta if param_cycle_accurate_model is 0.
>       (model_excess_cost): Ceil negative baseECC to 0 only if
>       param_cycle_accurate_model is 1.
>       Dump the actual ECC value.
>       * config/riscv/riscv.cc (riscv_option_override): Set param
>       to 0.
>
> gcc/testsuite/ChangeLog:
>       PR target/114729
>       * gcc.target/riscv/riscv.exp: Enable new tests to build.
>       * gcc.target/riscv/sched1-spills/spill1.cpp: Add new test.
>
> Signed-off-by: Vineet Gupta <vine...@rivosinc.com>
> ---
>  gcc/config/riscv/riscv.cc                     |  4 +++
>  gcc/doc/invoke.texi                           |  7 ++++
>  gcc/haifa-sched.cc                            | 32 ++++++++++++++-----
>  gcc/params.opt                                |  4 +++
>  gcc/testsuite/gcc.target/riscv/riscv.exp      |  2 ++
>  .../gcc.target/riscv/sched1-spills/spill1.cpp | 32 +++++++++++++++++++
>  6 files changed, 73 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index 2e9ac280c8f2..75fcadfc3b58 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -10549,6 +10549,10 @@ riscv_option_override (void)
>                      param_sched_pressure_algorithm,
>                      SCHED_PRESSURE_MODEL);
>  
> +  SET_OPTION_IF_UNSET (&global_options, &global_options_set,
> +                    param_cycle_accurate_model,
> +                    0);
> +
>    /* Function to allocate machine-dependent function status.  */
>    init_machine_status = &riscv_init_machine_status;
>  
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 7146163d66d0..c1e07e258b25 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -17084,6 +17084,13 @@ With @option{--param=openacc-privatization=quiet}, 
> don't diagnose.
>  This is the current default.
>  With @option{--param=openacc-privatization=noisy}, do diagnose.
>  
> +@item cycle-accurate-model
> +Specifies whether GCC should assume that the scheduling description is mostly
> +a cycle-accurate model of the target processor, where the code is intended to
> +run on, in the absence of cache misses.  Nonzero means that the selected 
> scheduling
> +model is accuate and likely describes an in-order processor, and that 
> scheduling
> +will aggressively spill to try and fill any pipeline bubbles.
> +
>  @end table
>  
>  The following choices of @var{name} are available on AArch64 targets:
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 02c893ec5cd3..cd4b6baddcd2 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -2398,11 +2398,18 @@ model_excess_group_cost (struct model_pressure_group 
> *group,
>    int pressure, cl;
>  
>    cl = ira_pressure_classes[pci];
> -  if (delta < 0 && point >= group->limits[pci].point)
> +  if (delta < 0)
>      {
> -      pressure = MAX (group->limits[pci].orig_pressure,
> -                   curr_reg_pressure[cl] + delta);
> -      return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
> +      if (point >= group->limits[pci].point)
> +     {
> +       pressure = MAX (group->limits[pci].orig_pressure,
> +                       curr_reg_pressure[cl] + delta);
> +       return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
> +     }
> +      /* if target prefers fewer spills, return the -ve delta indicating
> +      pressure reduction.  */
> +      else if (!param_cycle_accurate_model)
> +       return delta;
>      }
>  
>    if (delta > 0)
> @@ -2453,7 +2460,7 @@ model_excess_cost (rtx_insn *insn, bool print_p)
>      }
>  
>    if (print_p)
> -    fprintf (sched_dump, "\n");
> +    fprintf (sched_dump, " ECC %d\n", cost);
>  
>    return cost;
>  }
> @@ -2489,8 +2496,9 @@ model_set_excess_costs (rtx_insn **insns, int count)
>    bool print_p;
>  
>    /* Record the baseECC value for each instruction in the model schedule,
> -     except that negative costs are converted to zero ones now rather than
> -     later.  Do not assign a cost to debug instructions, since they must
> +     except that for targets which prefer wider schedules (more spills)
> +     negative costs are converted to zero ones now rather than later.
> +     Do not assign a cost to debug instructions, since they must
>       not change code-generation decisions.  Experiments suggest we also
>       get better results by not assigning a cost to instructions from
>       a different block.
> @@ -2512,7 +2520,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>           print_p = true;
>         }
>       cost = model_excess_cost (insns[i], print_p);
> -     if (cost <= 0)
> +     if (param_cycle_accurate_model && cost <= 0)
>         {
>           priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
>           priority_base = MAX (priority_base, priority);
> @@ -2523,6 +2531,14 @@ model_set_excess_costs (rtx_insn **insns, int count)
>    if (print_p)
>      fprintf (sched_dump, MODEL_BAR);
>  
> +  /* Typically in-order cores have a good pipeline scheduling model and the
> +     algorithm would try to use that to minimize bubbles, favoring spills.
> +     MAX (baseECC, 0) below changes negative baseECC (pressure reduction)
> +     to 0 (pressure neutral) thus tending to more spills.
> +     Otherwise return.  */
> +  if (!param_cycle_accurate_model)
> +    return;
> +
>    /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
>       instruction.  */
>    for (i = 0; i < count; i++)
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 7c572774df24..6d73993cd089 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -66,6 +66,10 @@ Enable asan stack protection.
>  Common Joined UInteger Var(param_asan_use_after_return) Init(1) 
> IntegerRange(0, 1) Param Optimization
>  Enable asan detection of use-after-return bugs.
>  
> +-param=cycle-accurate-model
> +Common Joined UInteger Var(param_cycle_accurate_model) Init(1) 
> IntegerRange(0, 1) Param Optimization
> +Whether the scheduling description is mostly a cycle-accurate model of the 
> target processor and is likely to be spill aggressively to fill any pipeline 
> bubbles.
> +
>  -param=hwasan-instrument-stack=
>  Common Joined UInteger Var(param_hwasan_instrument_stack) Init(1) 
> IntegerRange(0, 1) Param Optimization
>  Enable hwasan instrumentation of statically sized stack-allocated variables.
> diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp 
> b/gcc/testsuite/gcc.target/riscv/riscv.exp
> index 187eb6640470..3cbbf63b9d0a 100644
> --- a/gcc/testsuite/gcc.target/riscv/riscv.exp
> +++ b/gcc/testsuite/gcc.target/riscv/riscv.exp
> @@ -38,6 +38,8 @@ dg-init
>  # Main loop.
>  gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*.\[cS\]]] \
>       "" $DEFAULT_CFLAGS
> +gcc-dg-runtest [lsort [glob -nocomplain 
> $srcdir/$subdir/sched1-spills/*.{\[cS\],cpp}]] \
> +     "" $DEFAULT_CFLAGS
>  
>  # All done.
>  dg-finish
> diff --git a/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp 
> b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
> new file mode 100644
> index 000000000000..8060ec245281
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
> @@ -0,0 +1,32 @@
> +/* { dg-options "-O2 -march=rv64gc -mabi=lp64d -save-temps -fverbose-asm" } 
> */
> +/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
> +
> +/* Reduced from SPEC2017 Cactu ML_BSSN_Advect.cpp
> +   by comparing -fschedule-insn and -fno-schedule-insns builds.
> +   Shows up one extra spill (pair of spill markers "sfp") in verbose asm
> +   output which the patch fixes.  */
> +
> +void s();
> +double b, c, d, e, f, g, h, k, l, m, n, o, p, q, t, u, v;
> +int *j;
> +double *r, *w;
> +long x;
> +void y() {
> +  double *a((double *)s);
> +  for (;;)
> +    for (; j[1];)
> +      for (int i = 1; i < j[0]; i++) {
> +        k = l;
> +        m = n;
> +        o = p = q;
> +        r[0] = t;
> +        a[0] = u;
> +        x = g;
> +        e = f;
> +        v = w[x];
> +        b = c;
> +        d = h;
> +        }
> +}
> +
> +/* { dg-final { scan-assembler-not "%sfp" } } */

Reply via email to