On Thu, Aug 8, 2024 at 12:17 AM Vineet Gupta <vine...@rivosinc.com> wrote:
>
> On 8/7/24 12:28, Jeff Law wrote:
> > On 8/7/24 11:47 AM, Richard Sandiford wrote:
> >> I should probably start by saying that the "model" heuristic is now
> >> pretty old and was originally tuned for an in-order AArch32 core.
> >> The aim wasn't to *minimise* spilling, but to strike a better balance
> >> between parallelising with spills vs. sequentialising.  At the time,
> >> scheduling without taking register pressure into account would overly
> >> parallelise things, whereas the original -fsched-pressure would overly
> >> serialise (i.e. was too conservative).
> >>
> >> There were specific workloads in, er, a formerly popular embedded
> >> benchmark that benefitted significantly from *some* spilling.
> >>
> >> This comment probably sums up the trade-off best:
> >>
> >>     This pressure cost is deliberately timid.  The intention has been
> >>     to choose a heuristic that rarely interferes with the normal list
> >>     scheduler in cases where that scheduler would produce good code.
> >>     We simply want to curb some of its worst excesses.
> >>
> >> Because it was tuned for an in-order core, it was operating in an
> >> environment where instruction latencies were meaningful and realistic.
> >> So it still deferred to those to quite a big extent.  This is almost
> >> certainly too conservative for out-of-order cores.
> > What's interesting here is that the increased spilling roughly doubles
> > the number of dynamic instructions we have to execute for the benchmark.
> >   While a good uarch design can hide a lot of that overhead, it's still
> > crazy bad.
>
> [snip...]
>
> >> ...I think for OoO cores, this:
> >>
> >>     baseECC (X) could itself be used as the ECC value described above.
> >>     However, this is often too conservative, in the sense that it
> >>     tends to make high-priority instructions that increase pressure
> >>     wait too long in cases where introducing a spill would be better.
> >>     For this reason the final ECC is a priority-adjusted form of
> >>     baseECC (X).  Specifically, we calculate:
> >>
> >>       P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
> >>       baseP = MAX { P (X) | baseECC (X) <= 0 }
> >>
> >>     Then:
> >>
> >>       ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)
> >>
> >>     Thus an instruction's effect on pressure is ignored if it has a high
> >>     enough priority relative to the ones that don't increase pressure.
> >>     Negative values of baseECC (X) do not increase the priority of X
> >>     itself, but they do make it harder for other instructions to
> >>     increase the pressure further.
> >>
> >> is probably not appropriate.  We should probably just use the baseECC,
> >> as suggested by the first sentence in the comment.  It looks like the hack:
> >>
> >> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> >> index 1bc610f9a5f..9601e929a88 100644
> >> --- a/gcc/haifa-sched.cc
> >> +++ b/gcc/haifa-sched.cc
> >> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
> >>          print_p = true;
> >>        }
> >>      cost = model_excess_cost (insns[i], print_p);
> >> -    if (cost <= 0)
> >> +    if (cost <= 0 && 0)
> >>        {
> >>          priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - 
> >> cost;
> >>          priority_base = MAX (priority_base, priority);
> >> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
> >>
> >>     /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
> >>        instruction.  */
> >> +  if (0)
> >>     for (i = 0; i < count; i++)
> >>       {
> >>         cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);
> >>
> >> fixes things for me.  Perhaps we should replace these && 0s
> >> with a query for an out-of-order core?
>
> Yes removing this heuristics does improves things but unfortunately it seems 
> there's more in sched1 that needs unraveling - Jeff is right after all :-)
>
>                                                                     |  
> upstream  | -fno-schedule |  Patch  |
>                                                                     |         
>    |    -insns     |         |
>                                                                     |         
>    |               |         |
> _ZL24ML_BSSN_Dissipation_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |    
> 55,702  |        43,132 |  45,788 |
> _ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv      |   
> 144,278  |        59,204 | 132,588 |
> _ZL24ML_BSSN_constraints_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |   
> 321,476  |       138,074 | 253,206 |
> _ZL16ML_BSSN_RHS_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv         |   
> 483,794  |       179,694 | 360,286 |

Note even on x86 we spill like crazy in these functions - we are
dealing with >16 memory streams here so it is
inevitable that spilling is necessary with the tendency to hoist loads
and sink stores.

So this isn't a good testcase to tune the scheduler with but rather a
bad outlier.

>
>
> >>
> >> I haven't benchmarked this. :)  And I'm looking at the code for the
> >> first time in many years, so I'm certainly forgetting details.
> > Well, I think we're probably too focused on ooo vs in-order.  The
> > badness we're seeing I think would likely trigger on the in-order risc-v
> > implementations out there as well.  Vineet, just to be sure, what core
> > are we scheduling for in your spec tests?
>
> I'm not specifying anything so it has to be default mtune of rocket which is 
> not ooo (this is all upstream so not with rivos cpu tune param)
> But see below....
>
>
> > I'm sure Vineet will correct me if I'm wrong but I think this is all
> > about spilling of address computations.  One of the 2nd order questions
> > I've had is whether or not it's a cascading effect in actual benchmark.
> > ie, we spill so that we can compute some address.  Does that in turn
> > force something else to then need to spill, and so-on until we finally
> > settle down.  If so can we characterize when that happens and perhaps
> > make spill avoidance more aggressive when it looks like we're spilling
> > address computations that are likely to have this kind of cascading effect.
>
> The original cactu spills are very likely due to address computations but the 
> reduced test has nothing to do with these specifics. sched1 is simply
> spilling a reg (alu+store) because the ECC computation heuristics are 
> dampening away the cost of reducing reg pressure - so IMHO I agree with 
> Richard
> that it at least needs to be made toggle'able behavior if not outrightly 
> removed. Whether we do this under the guise of ooo vs. in-order or a new
> heuristic of sched1 reduce spills is a different question.
>
> Granted I'm not sure if I quite understand how that original heuristic was 
> helping in-order cores:
>
>     "... parallelising with spills vs. sequentialising ..." I understand this 
> is all paraphrased but even semantically how could spilling aid with
> parallelizing - especially on an in-order core. Thx, -Vineet

Reply via email to