On 11/6/24 12:11, Vineet Gupta wrote: > changes since v1 > * Changed target hook to --param > * squash addon patch for RISC-V opting-in, testcase here > * updated changelog with latest perf numbers
ping ! > --- > > sched1 computes ECC (Excess Change Cost) for each insn, which represents > the register pressure attributed to the insn. > Currently the pressure sensitive schduling algorithm deliberately ignores > negative values (pressure reduction), making them 0 (neutral), leading > to more spills. This happens due to the assumption that the compiler has > a reasonably accurate processor pipeline scheduling model and thus tries > to aggresively fill pipeline bubbles with spill slots. > > This however might not be true, as the model might not be available for > certains uarches or even applicable especially for modern out-of-order cores. > > The existing heuristic induces spill frenzy on RISC-V, noticably so on > SPEC2017 507.Cactu. If insn scheduling is disabled completely, the > total dynamic icounts for this workload are reduced in half from > ~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns). > > This patch adds --param=cycle-accurate-model={0,1} to gate the spill > behavior. > > - The default (1) preserves existing spill behavior. > > - targets/uarches sensitive to spilling can override the param to (0) > to get the reverse effect. RISC-V backend does so too. > > The actual perf numbers are very promising. > > (1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs: > > Before: > ------ > Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par': > > 4,917,712.97 msec task-clock:u # 1.000 CPUs > utilized > 5,314 context-switches:u # 1.081 /sec > 3 cpu-migrations:u # 0.001 /sec > 204,784 page-faults:u # 41.642 /sec > 7,868,291,222,513 cycles:u # 1.600 GHz > 2,615,069,866,153 instructions:u # 0.33 insn per > cycle > 10,799,381,890 branches:u # 2.196 M/sec > 15,714,572 branch-misses:u # 0.15% of all > branches > > After: > ----- > Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par': > > 4,552,979.58 msec task-clock:u # 0.998 CPUs > utilized > 205,020 context-switches:u # 45.030 /sec > 2 cpu-migrations:u # 0.000 /sec > 204,221 page-faults:u # 44.854 /sec > 7,285,176,204,764 cycles:u (7.4% faster) # 1.600 GHz > 2,145,284,345,397 instructions:u (17.96% fewer) # 0.29 insn per > cycle > 10,799,382,011 branches:u # 2.372 M/sec > 16,235,628 branch-misses:u # 0.15% of all > branches > > (2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs. > > gcc/ChangeLog: > PR target/11472 > * params.opt (--param=cycle-accurate-model=): New opt. > * doc/invoke.texi (cycle-accurate-model): Document. > * haifa-sched.cc (model_excess_group_cost): Return negative > delta if param_cycle_accurate_model is 0. > (model_excess_cost): Ceil negative baseECC to 0 only if > param_cycle_accurate_model is 1. > Dump the actual ECC value. > * config/riscv/riscv.cc (riscv_option_override): Set param > to 0. > > gcc/testsuite/ChangeLog: > PR target/114729 > * gcc.target/riscv/riscv.exp: Enable new tests to build. > * gcc.target/riscv/sched1-spills/spill1.cpp: Add new test. > > Signed-off-by: Vineet Gupta <vine...@rivosinc.com> > --- > gcc/config/riscv/riscv.cc | 4 +++ > gcc/doc/invoke.texi | 7 ++++ > gcc/haifa-sched.cc | 32 ++++++++++++++----- > gcc/params.opt | 4 +++ > gcc/testsuite/gcc.target/riscv/riscv.exp | 2 ++ > .../gcc.target/riscv/sched1-spills/spill1.cpp | 32 +++++++++++++++++++ > 6 files changed, 73 insertions(+), 8 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > index 2e9ac280c8f2..75fcadfc3b58 100644 > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -10549,6 +10549,10 @@ riscv_option_override (void) > param_sched_pressure_algorithm, > SCHED_PRESSURE_MODEL); > > + SET_OPTION_IF_UNSET (&global_options, &global_options_set, > + param_cycle_accurate_model, > + 0); > + > /* Function to allocate machine-dependent function status. */ > init_machine_status = &riscv_init_machine_status; > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 7146163d66d0..c1e07e258b25 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -17084,6 +17084,13 @@ With @option{--param=openacc-privatization=quiet}, > don't diagnose. > This is the current default. > With @option{--param=openacc-privatization=noisy}, do diagnose. > > +@item cycle-accurate-model > +Specifies whether GCC should assume that the scheduling description is mostly > +a cycle-accurate model of the target processor, where the code is intended to > +run on, in the absence of cache misses. Nonzero means that the selected > scheduling > +model is accuate and likely describes an in-order processor, and that > scheduling > +will aggressively spill to try and fill any pipeline bubbles. > + > @end table > > The following choices of @var{name} are available on AArch64 targets: > diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc > index 02c893ec5cd3..cd4b6baddcd2 100644 > --- a/gcc/haifa-sched.cc > +++ b/gcc/haifa-sched.cc > @@ -2398,11 +2398,18 @@ model_excess_group_cost (struct model_pressure_group > *group, > int pressure, cl; > > cl = ira_pressure_classes[pci]; > - if (delta < 0 && point >= group->limits[pci].point) > + if (delta < 0) > { > - pressure = MAX (group->limits[pci].orig_pressure, > - curr_reg_pressure[cl] + delta); > - return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]); > + if (point >= group->limits[pci].point) > + { > + pressure = MAX (group->limits[pci].orig_pressure, > + curr_reg_pressure[cl] + delta); > + return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]); > + } > + /* if target prefers fewer spills, return the -ve delta indicating > + pressure reduction. */ > + else if (!param_cycle_accurate_model) > + return delta; > } > > if (delta > 0) > @@ -2453,7 +2460,7 @@ model_excess_cost (rtx_insn *insn, bool print_p) > } > > if (print_p) > - fprintf (sched_dump, "\n"); > + fprintf (sched_dump, " ECC %d\n", cost); > > return cost; > } > @@ -2489,8 +2496,9 @@ model_set_excess_costs (rtx_insn **insns, int count) > bool print_p; > > /* Record the baseECC value for each instruction in the model schedule, > - except that negative costs are converted to zero ones now rather than > - later. Do not assign a cost to debug instructions, since they must > + except that for targets which prefer wider schedules (more spills) > + negative costs are converted to zero ones now rather than later. > + Do not assign a cost to debug instructions, since they must > not change code-generation decisions. Experiments suggest we also > get better results by not assigning a cost to instructions from > a different block. > @@ -2512,7 +2520,7 @@ model_set_excess_costs (rtx_insn **insns, int count) > print_p = true; > } > cost = model_excess_cost (insns[i], print_p); > - if (cost <= 0) > + if (param_cycle_accurate_model && cost <= 0) > { > priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost; > priority_base = MAX (priority_base, priority); > @@ -2523,6 +2531,14 @@ model_set_excess_costs (rtx_insn **insns, int count) > if (print_p) > fprintf (sched_dump, MODEL_BAR); > > + /* Typically in-order cores have a good pipeline scheduling model and the > + algorithm would try to use that to minimize bubbles, favoring spills. > + MAX (baseECC, 0) below changes negative baseECC (pressure reduction) > + to 0 (pressure neutral) thus tending to more spills. > + Otherwise return. */ > + if (!param_cycle_accurate_model) > + return; > + > /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each > instruction. */ > for (i = 0; i < count; i++) > diff --git a/gcc/params.opt b/gcc/params.opt > index 7c572774df24..6d73993cd089 100644 > --- a/gcc/params.opt > +++ b/gcc/params.opt > @@ -66,6 +66,10 @@ Enable asan stack protection. > Common Joined UInteger Var(param_asan_use_after_return) Init(1) > IntegerRange(0, 1) Param Optimization > Enable asan detection of use-after-return bugs. > > +-param=cycle-accurate-model > +Common Joined UInteger Var(param_cycle_accurate_model) Init(1) > IntegerRange(0, 1) Param Optimization > +Whether the scheduling description is mostly a cycle-accurate model of the > target processor and is likely to be spill aggressively to fill any pipeline > bubbles. > + > -param=hwasan-instrument-stack= > Common Joined UInteger Var(param_hwasan_instrument_stack) Init(1) > IntegerRange(0, 1) Param Optimization > Enable hwasan instrumentation of statically sized stack-allocated variables. > diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp > b/gcc/testsuite/gcc.target/riscv/riscv.exp > index 187eb6640470..3cbbf63b9d0a 100644 > --- a/gcc/testsuite/gcc.target/riscv/riscv.exp > +++ b/gcc/testsuite/gcc.target/riscv/riscv.exp > @@ -38,6 +38,8 @@ dg-init > # Main loop. > gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*.\[cS\]]] \ > "" $DEFAULT_CFLAGS > +gcc-dg-runtest [lsort [glob -nocomplain > $srcdir/$subdir/sched1-spills/*.{\[cS\],cpp}]] \ > + "" $DEFAULT_CFLAGS > > # All done. > dg-finish > diff --git a/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp > b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp > new file mode 100644 > index 000000000000..8060ec245281 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp > @@ -0,0 +1,32 @@ > +/* { dg-options "-O2 -march=rv64gc -mabi=lp64d -save-temps -fverbose-asm" } > */ > +/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */ > + > +/* Reduced from SPEC2017 Cactu ML_BSSN_Advect.cpp > + by comparing -fschedule-insn and -fno-schedule-insns builds. > + Shows up one extra spill (pair of spill markers "sfp") in verbose asm > + output which the patch fixes. */ > + > +void s(); > +double b, c, d, e, f, g, h, k, l, m, n, o, p, q, t, u, v; > +int *j; > +double *r, *w; > +long x; > +void y() { > + double *a((double *)s); > + for (;;) > + for (; j[1];) > + for (int i = 1; i < j[0]; i++) { > + k = l; > + m = n; > + o = p = q; > + r[0] = t; > + a[0] = u; > + x = g; > + e = f; > + v = w[x]; > + b = c; > + d = h; > + } > +} > + > +/* { dg-final { scan-assembler-not "%sfp" } } */