On Thu, Aug 8, 2024 at 12:17 AM Vineet Gupta <vine...@rivosinc.com> wrote: > > On 8/7/24 12:28, Jeff Law wrote: > > On 8/7/24 11:47 AM, Richard Sandiford wrote: > >> I should probably start by saying that the "model" heuristic is now > >> pretty old and was originally tuned for an in-order AArch32 core. > >> The aim wasn't to *minimise* spilling, but to strike a better balance > >> between parallelising with spills vs. sequentialising. At the time, > >> scheduling without taking register pressure into account would overly > >> parallelise things, whereas the original -fsched-pressure would overly > >> serialise (i.e. was too conservative). > >> > >> There were specific workloads in, er, a formerly popular embedded > >> benchmark that benefitted significantly from *some* spilling. > >> > >> This comment probably sums up the trade-off best: > >> > >> This pressure cost is deliberately timid. The intention has been > >> to choose a heuristic that rarely interferes with the normal list > >> scheduler in cases where that scheduler would produce good code. > >> We simply want to curb some of its worst excesses. > >> > >> Because it was tuned for an in-order core, it was operating in an > >> environment where instruction latencies were meaningful and realistic. > >> So it still deferred to those to quite a big extent. This is almost > >> certainly too conservative for out-of-order cores. > > What's interesting here is that the increased spilling roughly doubles > > the number of dynamic instructions we have to execute for the benchmark. > > While a good uarch design can hide a lot of that overhead, it's still > > crazy bad. > > [snip...] > > >> ...I think for OoO cores, this: > >> > >> baseECC (X) could itself be used as the ECC value described above. > >> However, this is often too conservative, in the sense that it > >> tends to make high-priority instructions that increase pressure > >> wait too long in cases where introducing a spill would be better. > >> For this reason the final ECC is a priority-adjusted form of > >> baseECC (X). Specifically, we calculate: > >> > >> P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X) > >> baseP = MAX { P (X) | baseECC (X) <= 0 } > >> > >> Then: > >> > >> ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0) > >> > >> Thus an instruction's effect on pressure is ignored if it has a high > >> enough priority relative to the ones that don't increase pressure. > >> Negative values of baseECC (X) do not increase the priority of X > >> itself, but they do make it harder for other instructions to > >> increase the pressure further. > >> > >> is probably not appropriate. We should probably just use the baseECC, > >> as suggested by the first sentence in the comment. It looks like the hack: > >> > >> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc > >> index 1bc610f9a5f..9601e929a88 100644 > >> --- a/gcc/haifa-sched.cc > >> +++ b/gcc/haifa-sched.cc > >> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count) > >> print_p = true; > >> } > >> cost = model_excess_cost (insns[i], print_p); > >> - if (cost <= 0) > >> + if (cost <= 0 && 0) > >> { > >> priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - > >> cost; > >> priority_base = MAX (priority_base, priority); > >> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count) > >> > >> /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each > >> instruction. */ > >> + if (0) > >> for (i = 0; i < count; i++) > >> { > >> cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]); > >> > >> fixes things for me. Perhaps we should replace these && 0s > >> with a query for an out-of-order core? > > Yes removing this heuristics does improves things but unfortunately it seems > there's more in sched1 that needs unraveling - Jeff is right after all :-) > > | > upstream | -fno-schedule | Patch | > | > | -insns | | > | > | | | > _ZL24ML_BSSN_Dissipation_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 55,702 | 43,132 | 45,788 | > _ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 144,278 | 59,204 | 132,588 | > _ZL24ML_BSSN_constraints_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 321,476 | 138,074 | 253,206 | > _ZL16ML_BSSN_RHS_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 483,794 | 179,694 | 360,286 |
Note even on x86 we spill like crazy in these functions - we are dealing with >16 memory streams here so it is inevitable that spilling is necessary with the tendency to hoist loads and sink stores. So this isn't a good testcase to tune the scheduler with but rather a bad outlier. > > > >> > >> I haven't benchmarked this. :) And I'm looking at the code for the > >> first time in many years, so I'm certainly forgetting details. > > Well, I think we're probably too focused on ooo vs in-order. The > > badness we're seeing I think would likely trigger on the in-order risc-v > > implementations out there as well. Vineet, just to be sure, what core > > are we scheduling for in your spec tests? > > I'm not specifying anything so it has to be default mtune of rocket which is > not ooo (this is all upstream so not with rivos cpu tune param) > But see below.... > > > > I'm sure Vineet will correct me if I'm wrong but I think this is all > > about spilling of address computations. One of the 2nd order questions > > I've had is whether or not it's a cascading effect in actual benchmark. > > ie, we spill so that we can compute some address. Does that in turn > > force something else to then need to spill, and so-on until we finally > > settle down. If so can we characterize when that happens and perhaps > > make spill avoidance more aggressive when it looks like we're spilling > > address computations that are likely to have this kind of cascading effect. > > The original cactu spills are very likely due to address computations but the > reduced test has nothing to do with these specifics. sched1 is simply > spilling a reg (alu+store) because the ECC computation heuristics are > dampening away the cost of reducing reg pressure - so IMHO I agree with > Richard > that it at least needs to be made toggle'able behavior if not outrightly > removed. Whether we do this under the guise of ooo vs. in-order or a new > heuristic of sched1 reduce spills is a different question. > > Granted I'm not sure if I quite understand how that original heuristic was > helping in-order cores: > > "... parallelising with spills vs. sequentialising ..." I understand this > is all paraphrased but even semantically how could spilling aid with > parallelizing - especially on an in-order core. Thx, -Vineet