On 8/7/24 11:47 AM, Richard Sandiford wrote:
I should probably start by saying that the "model" heuristic is now
pretty old and was originally tuned for an in-order AArch32 core.
The aim wasn't to *minimise* spilling, but to strike a better balance
between parallelising with spills vs. sequentialising.  At the time,
scheduling without taking register pressure into account would overly
parallelise things, whereas the original -fsched-pressure would overly
serialise (i.e. was too conservative).

There were specific workloads in, er, a formerly popular embedded
benchmark that benefitted significantly from *some* spilling.

This comment probably sums up the trade-off best:

    This pressure cost is deliberately timid.  The intention has been
    to choose a heuristic that rarely interferes with the normal list
    scheduler in cases where that scheduler would produce good code.
    We simply want to curb some of its worst excesses.

Because it was tuned for an in-order core, it was operating in an
environment where instruction latencies were meaningful and realistic.
So it still deferred to those to quite a big extent.  This is almost
certainly too conservative for out-of-order cores.
What's interesting here is that the increased spilling roughly doubles the number of dynamic instructions we have to execute for the benchmark. While a good uarch design can hide a lot of that overhead, it's still crazy bad.





Note that in rank_for_schedule we check pressure state before we check
priority.  So it's a bit unclear why RFS_PRIORITY was selected when
comparing insns 55 and 54.  I guess that would tend to indicate that the
ECC wasn't enough to make a difference:

Indeed. ECC delta would have to be neutral for it to be skipped. The
order of things checked is what I enumerated above.

I'm currently pursuing a different trail which comes form observation
that initial model setup concludes that pressure is 28 so with 27
allocable regs we are bound to spill one.
More on that after I find something concrete.

...I think for OoO cores, this:

    baseECC (X) could itself be used as the ECC value described above.
    However, this is often too conservative, in the sense that it
    tends to make high-priority instructions that increase pressure
    wait too long in cases where introducing a spill would be better.
    For this reason the final ECC is a priority-adjusted form of
    baseECC (X).  Specifically, we calculate:

      P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
      baseP = MAX { P (X) | baseECC (X) <= 0 }

    Then:

      ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)

    Thus an instruction's effect on pressure is ignored if it has a high
    enough priority relative to the ones that don't increase pressure.
    Negative values of baseECC (X) do not increase the priority of X
    itself, but they do make it harder for other instructions to
    increase the pressure further.

is probably not appropriate.  We should probably just use the baseECC,
as suggested by the first sentence in the comment.  It looks like the hack:

diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 1bc610f9a5f..9601e929a88 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
            print_p = true;
          }
        cost = model_excess_cost (insns[i], print_p);
-       if (cost <= 0)
+       if (cost <= 0 && 0)
          {
            priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
            priority_base = MAX (priority_base, priority);
@@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
/* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
       instruction.  */
+  if (0)
    for (i = 0; i < count; i++)
      {
        cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);

fixes things for me.  Perhaps we should replace these && 0s
with a query for an out-of-order core?

I haven't benchmarked this. :)  And I'm looking at the code for the
first time in many years, so I'm certainly forgetting details.
Well, I think we're probably too focused on ooo vs in-order. The badness we're seeing I think would likely trigger on the in-order risc-v implementations out there as well. Vineet, just to be sure, what core are we scheduling for in your spec tests?

I'm sure Vineet will correct me if I'm wrong but I think this is all about spilling of address computations. One of the 2nd order questions I've had is whether or not it's a cascading effect in actual benchmark. ie, we spill so that we can compute some address. Does that in turn force something else to then need to spill, and so-on until we finally settle down. If so can we characterize when that happens and perhaps make spill avoidance more aggressive when it looks like we're spilling address computations that are likely to have this kind of cascading effect.

Jeff

Thanks,
Richard

Reply via email to