sched1 computes ECC (Excess Change Cost) for each insn, which represents
the register pressure attributed to the insn.
Currently the pressure sensitive scheduling algorithm deliberately ignores
negative ECC values (pressure reduction), making them 0 (neutral), leading
to more spills. This happens due to the assumption that the compiler has
a reasonably accurate processor pipeline scheduling model and thus tries
to aggresively fill pipeline bubbles with spill slots.

This however might not be true, as the model might not be available for
certains uarches or even applicable especially for modern out-of-order cores.

The existing heuristic induces spill frenzy on RISC-V, noticably so on
SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
total dynamic icounts for this workload are reduced in half from
~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).

This patch adds --param=cycle-accurate-model={0,1} to gate the spill
behavior.

 - The default (1) preserves existing spill behavior.

 - targets/uarches sensitive to spilling can override the param to (0)
   to get the reverse effect. RISC-V backend does so too.

The actual perf numbers are very promising.

(1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:

  Before:
  ------
  Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':

      4,917,712.97 msec task-clock:u                     #    1.000 CPUs 
utilized
             5,314      context-switches:u               #    1.081 /sec
                 3      cpu-migrations:u                 #    0.001 /sec
           204,784      page-faults:u                    #   41.642 /sec
 7,868,291,222,513      cycles:u                         #    1.600 GHz
 2,615,069,866,153      instructions:u                   #    0.33  insn per 
cycle
    10,799,381,890      branches:u                       #    2.196 M/sec
        15,714,572      branch-misses:u                  #    0.15% of all 
branches

  After:
  -----
  Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':

      4,552,979.58 msec task-clock:u                     #    0.998 CPUs 
utilized
           205,020      context-switches:u               #   45.030 /sec
                 2      cpu-migrations:u                 #    0.000 /sec
           204,221      page-faults:u                    #   44.854 /sec
 7,285,176,204,764      cycles:u        (7.4% faster)    #    1.600 GHz
 2,145,284,345,397      instructions:u (17.96% fewer)    #    0.29  insn per 
cycle
    10,799,382,011      branches:u                       #    2.372 M/sec
        16,235,628      branch-misses:u                  #    0.15% of all 
branches

(2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.

gcc/ChangeLog:
        PR target/11472
        * params.opt (--param=cycle-accurate-model=): New opt.
        * doc/invoke.texi (cycle-accurate-model): Document.
        * haifa-sched.cc (model_excess_group_cost): Return negative
        delta if param_cycle_accurate_model is 0.
        (model_excess_cost): Ceil negative baseECC to 0 only if
        param_cycle_accurate_model is 1.
        Dump the actual ECC value.
        * config/riscv/riscv.cc (riscv_option_override): Set param
        to 0.

gcc/testsuite/ChangeLog:
        PR target/114729
        * gcc.target/riscv/riscv.exp: Enable new tests to build.
        * gcc.target/riscv/sched1-spills/spill1.cpp: Add new test.

Signed-off-by: Vineet Gupta <vine...@rivosinc.com>
---
 gcc/config/riscv/riscv.cc                     |  4 +++
 gcc/doc/invoke.texi                           | 10 ++++++
 gcc/haifa-sched.cc                            | 32 ++++++++++++++-----
 gcc/params.opt                                |  4 +++
 gcc/testsuite/gcc.target/riscv/riscv.exp      |  2 ++
 .../gcc.target/riscv/sched1-spills/spill1.cpp | 32 +++++++++++++++++++
 6 files changed, 76 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 0a6c00926b31..b9ddf7caee10 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -10616,6 +10616,10 @@ riscv_option_override (void)
                       param_sched_pressure_algorithm,
                       SCHED_PRESSURE_MODEL);
 
+  SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+                      param_cycle_accurate_model,
+                      0);
+
   /* Function to allocate machine-dependent function status.  */
   init_machine_status = &riscv_init_machine_status;
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e3c2adc25077..d2409a41d50a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -17211,6 +17211,16 @@ With @option{--param=openacc-privatization=quiet}, 
don't diagnose.
 This is the current default.
 With @option{--param=openacc-privatization=noisy}, do diagnose.
 
+@item cycle-accurate-model
+Specifies whether GCC should assume that the scheduling description is mostly
+a cycle-accurate model of the target processor, where the code is intended to
+run on, in the absence of cache misses.  Nonzero means that the selected
+scheduling model is accurate and likely describes an in-order processor,
+and that scheduling will aggressively spill to try and fill any pipeline
+bubbles.  This is the current default.  Zero could mean scheduling descrption
+might not be available/accurate or perhaps not applicale at all such as for
+modern out-of-order processors.
+
 @end table
 
 The following choices of @var{name} are available on AArch64 targets:
diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 02c893ec5cd3..cd4b6baddcd2 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -2398,11 +2398,18 @@ model_excess_group_cost (struct model_pressure_group 
*group,
   int pressure, cl;
 
   cl = ira_pressure_classes[pci];
-  if (delta < 0 && point >= group->limits[pci].point)
+  if (delta < 0)
     {
-      pressure = MAX (group->limits[pci].orig_pressure,
-                     curr_reg_pressure[cl] + delta);
-      return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
+      if (point >= group->limits[pci].point)
+       {
+         pressure = MAX (group->limits[pci].orig_pressure,
+                         curr_reg_pressure[cl] + delta);
+         return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
+       }
+      /* if target prefers fewer spills, return the -ve delta indicating
+        pressure reduction.  */
+      else if (!param_cycle_accurate_model)
+         return delta;
     }
 
   if (delta > 0)
@@ -2453,7 +2460,7 @@ model_excess_cost (rtx_insn *insn, bool print_p)
     }
 
   if (print_p)
-    fprintf (sched_dump, "\n");
+    fprintf (sched_dump, " ECC %d\n", cost);
 
   return cost;
 }
@@ -2489,8 +2496,9 @@ model_set_excess_costs (rtx_insn **insns, int count)
   bool print_p;
 
   /* Record the baseECC value for each instruction in the model schedule,
-     except that negative costs are converted to zero ones now rather than
-     later.  Do not assign a cost to debug instructions, since they must
+     except that for targets which prefer wider schedules (more spills)
+     negative costs are converted to zero ones now rather than later.
+     Do not assign a cost to debug instructions, since they must
      not change code-generation decisions.  Experiments suggest we also
      get better results by not assigning a cost to instructions from
      a different block.
@@ -2512,7 +2520,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
            print_p = true;
          }
        cost = model_excess_cost (insns[i], print_p);
-       if (cost <= 0)
+       if (param_cycle_accurate_model && cost <= 0)
          {
            priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
            priority_base = MAX (priority_base, priority);
@@ -2523,6 +2531,14 @@ model_set_excess_costs (rtx_insn **insns, int count)
   if (print_p)
     fprintf (sched_dump, MODEL_BAR);
 
+  /* Typically in-order cores have a good pipeline scheduling model and the
+     algorithm would try to use that to minimize bubbles, favoring spills.
+     MAX (baseECC, 0) below changes negative baseECC (pressure reduction)
+     to 0 (pressure neutral) thus tending to more spills.
+     Otherwise return.  */
+  if (!param_cycle_accurate_model)
+    return;
+
   /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
      instruction.  */
   for (i = 0; i < count; i++)
diff --git a/gcc/params.opt b/gcc/params.opt
index 6efadd8ca163..f5cc71d0f493 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -66,6 +66,10 @@ Enable asan stack protection.
 Common Joined UInteger Var(param_asan_use_after_return) Init(1) 
IntegerRange(0, 1) Param Optimization
 Enable asan detection of use-after-return bugs.
 
+-param=cycle-accurate-model
+Common Joined UInteger Var(param_cycle_accurate_model) Init(1) IntegerRange(0, 
1) Param Optimization
+Whether the scheduling description is mostly a cycle-accurate model of the 
target processor and is likely to be spill aggressively to fill any pipeline 
bubbles.
+
 -param=hwasan-instrument-stack=
 Common Joined UInteger Var(param_hwasan_instrument_stack) Init(1) 
IntegerRange(0, 1) Param Optimization
 Enable hwasan instrumentation of statically sized stack-allocated variables.
diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp 
b/gcc/testsuite/gcc.target/riscv/riscv.exp
index 3620ece47582..ce84081183f8 100644
--- a/gcc/testsuite/gcc.target/riscv/riscv.exp
+++ b/gcc/testsuite/gcc.target/riscv/riscv.exp
@@ -38,6 +38,8 @@ dg-init
 # Main loop.
 gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*.\[cS\]]] \
        "" $DEFAULT_CFLAGS
+gcc-dg-runtest [lsort [glob -nocomplain 
$srcdir/$subdir/sched1-spills/*.{\[cS\],cpp}]] \
+       "" $DEFAULT_CFLAGS
 
 # Saturation alu
 foreach opt {
diff --git a/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp 
b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
new file mode 100644
index 000000000000..8060ec245281
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
@@ -0,0 +1,32 @@
+/* { dg-options "-O2 -march=rv64gc -mabi=lp64d -save-temps -fverbose-asm" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
+
+/* Reduced from SPEC2017 Cactu ML_BSSN_Advect.cpp
+   by comparing -fschedule-insn and -fno-schedule-insns builds.
+   Shows up one extra spill (pair of spill markers "sfp") in verbose asm
+   output which the patch fixes.  */
+
+void s();
+double b, c, d, e, f, g, h, k, l, m, n, o, p, q, t, u, v;
+int *j;
+double *r, *w;
+long x;
+void y() {
+  double *a((double *)s);
+  for (;;)
+    for (; j[1];)
+      for (int i = 1; i < j[0]; i++) {
+        k = l;
+        m = n;
+        o = p = q;
+        r[0] = t;
+        a[0] = u;
+        x = g;
+        e = f;
+        v = w[x];
+        b = c;
+        d = h;
+        }
+}
+
+/* { dg-final { scan-assembler-not "%sfp" } } */
-- 
2.43.0

Reply via email to