Vineet Gupta <vine...@rivosinc.com> writes: > On 8/7/24 12:28, Jeff Law wrote: >> On 8/7/24 11:47 AM, Richard Sandiford wrote: >>> I should probably start by saying that the "model" heuristic is now >>> pretty old and was originally tuned for an in-order AArch32 core. >>> The aim wasn't to *minimise* spilling, but to strike a better balance >>> between parallelising with spills vs. sequentialising. At the time, >>> scheduling without taking register pressure into account would overly >>> parallelise things, whereas the original -fsched-pressure would overly >>> serialise (i.e. was too conservative). >>> >>> There were specific workloads in, er, a formerly popular embedded >>> benchmark that benefitted significantly from *some* spilling. >>> >>> This comment probably sums up the trade-off best: >>> >>> This pressure cost is deliberately timid. The intention has been >>> to choose a heuristic that rarely interferes with the normal list >>> scheduler in cases where that scheduler would produce good code. >>> We simply want to curb some of its worst excesses. >>> >>> Because it was tuned for an in-order core, it was operating in an >>> environment where instruction latencies were meaningful and realistic. >>> So it still deferred to those to quite a big extent. This is almost >>> certainly too conservative for out-of-order cores. >> What's interesting here is that the increased spilling roughly doubles >> the number of dynamic instructions we have to execute for the benchmark. >> While a good uarch design can hide a lot of that overhead, it's still >> crazy bad. > > [snip...] > >>> ...I think for OoO cores, this: >>> >>> baseECC (X) could itself be used as the ECC value described above. >>> However, this is often too conservative, in the sense that it >>> tends to make high-priority instructions that increase pressure >>> wait too long in cases where introducing a spill would be better. >>> For this reason the final ECC is a priority-adjusted form of >>> baseECC (X). Specifically, we calculate: >>> >>> P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X) >>> baseP = MAX { P (X) | baseECC (X) <= 0 } >>> >>> Then: >>> >>> ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0) >>> >>> Thus an instruction's effect on pressure is ignored if it has a high >>> enough priority relative to the ones that don't increase pressure. >>> Negative values of baseECC (X) do not increase the priority of X >>> itself, but they do make it harder for other instructions to >>> increase the pressure further. >>> >>> is probably not appropriate. We should probably just use the baseECC, >>> as suggested by the first sentence in the comment. It looks like the hack: >>> >>> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc >>> index 1bc610f9a5f..9601e929a88 100644 >>> --- a/gcc/haifa-sched.cc >>> +++ b/gcc/haifa-sched.cc >>> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count) >>> print_p = true; >>> } >>> cost = model_excess_cost (insns[i], print_p); >>> - if (cost <= 0) >>> + if (cost <= 0 && 0) >>> { >>> priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost; >>> priority_base = MAX (priority_base, priority); >>> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count) >>> >>> /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each >>> instruction. */ >>> + if (0) >>> for (i = 0; i < count; i++) >>> { >>> cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]); >>> >>> fixes things for me. Perhaps we should replace these && 0s >>> with a query for an out-of-order core? > > Yes removing this heuristics does improves things but unfortunately it seems > there's more in sched1 that needs unraveling - Jeff is right after all :-) > > | > upstream | -fno-schedule | Patch | > | > | -insns | | > | > | | | > _ZL24ML_BSSN_Dissipation_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 55,702 | 43,132 | 45,788 | > _ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 144,278 | 59,204 | 132,588 | > _ZL24ML_BSSN_constraints_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 321,476 | 138,074 | 253,206 | > _ZL16ML_BSSN_RHS_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv | > 483,794 | 179,694 | 360,286 | > > > >>> >>> I haven't benchmarked this. :) And I'm looking at the code for the >>> first time in many years, so I'm certainly forgetting details. >> Well, I think we're probably too focused on ooo vs in-order. The >> badness we're seeing I think would likely trigger on the in-order risc-v >> implementations out there as well. Vineet, just to be sure, what core >> are we scheduling for in your spec tests? > > I'm not specifying anything so it has to be default mtune of rocket which is > not ooo (this is all upstream so not with rivos cpu tune param)
I think the question is more: which kind of core is the code being run on? (Rather than: what is GCC tuning for?) What we're trying to measure is how spilling/not spilling affects the recorded cycle count. I don't think instruction counts are a good proxy for performance on in-order cores, especially when it comes to spilling. > But see below.... > > >> I'm sure Vineet will correct me if I'm wrong but I think this is all >> about spilling of address computations. One of the 2nd order questions >> I've had is whether or not it's a cascading effect in actual benchmark. >> ie, we spill so that we can compute some address. Does that in turn >> force something else to then need to spill, and so-on until we finally >> settle down. If so can we characterize when that happens and perhaps >> make spill avoidance more aggressive when it looks like we're spilling >> address computations that are likely to have this kind of cascading effect. > > The original cactu spills are very likely due to address computations but the > reduced test has nothing to do with these specifics. sched1 is simply > spilling a reg (alu+store) because the ECC computation heuristics are > dampening away the cost of reducing reg pressure - so IMHO I agree with > Richard > that it at least needs to be made toggle'able behavior if not outrightly > removed. Whether we do this under the guise of ooo vs. in-order or a new > heuristic of sched1 reduce spills is a different question. > > Granted I'm not sure if I quite understand how that original heuristic was > helping in-order cores: > > "... parallelising with spills vs. sequentialising ..." I understand this > is all paraphrased but even semantically how could spilling aid with > parallelizing - especially on an in-order core. Thx, -Vineet With -fno-sched-pressure, the scheduler will tend to parallelise aggressively, such as by placing loads as soon as the pipeline model allows. Aggressively parallelising is what causes the spills without -fsched-pressure. The idea is to strike a balance between parallelising (potentially more spills, potentially fewer in-order pipeline stalls) and serialising (potentially fewer spills, potentially more in-order pipeline stalls). See https://gcc.gnu.org/legacy-ml/gcc-patches/2011-12/msg01684.html for a write-up of how this approach came about. I wrote the code at a previous employer, so don't have access to my notes. I probably couldn't share the details of the specific workloads anyway, due to the nature of the benchmark. But conceptually, I think the reason spilling can help for in-order cores is that scheduling tends to prioritise loads and deprioritise stores, meaning the in-order load/store units tend to be underutilised in the middle of a loop. E.g. for the sake of a simple example, take a contrived situation in which the core can issue: - one vector load/store a cycle - one vector ALU operation a cycle Let's assume: - loads have a latency of 3 and ALU ops a latency of 2 - good store-load forwarding - there are only 2 architected vector registers(!) Like I say, it's all for the sake of a simple example only. Let's assume that the input code is: I1: R1 = *PTR1 I2: R2 = op(R1) I3: R3 = op(R2) I4: R4 = op(R3) I5: R5 = op(R3) I6: R6 = op(R4,R5) I7: R7 = *PTR2 I8: R8 = op(R7) I9: R9 = op(R6,R8) IA: *PTR3 = R9 This doesn't spill, and would issue as follows: -- I1 -- -- -- -- I2 -- -- -- I3 -- -- -- I4 -- I5 -- -- -- I6 I7 -- -- -- -- I8 -- -- -- I9 -- -- -- -- IA (18 cycles). If we allow a spill, we can get: -- I1 -- I7 -- -- I2 -- I8 -- I3 -- -- XX I4 -- I5 -- -- -- I6 YY -- -- -- -- I9 -- -- -- -- IA (16 cycles, XX = spill, YY = reload). Spilling has then improved performance by ~11%. Thanks, Richard