Hi Richard, On 8/7/24 10:47, Richard Sandiford wrote: > I should probably start by saying that the "model" heuristic is now > pretty old and was originally tuned for an in-order AArch32 core. > The aim wasn't to *minimise* spilling, but to strike a better balance > between parallelising with spills vs. sequentialising. At the time, > scheduling without taking register pressure into account would overly > parallelise things, whereas the original -fsched-pressure would overly > serialise (i.e. was too conservative). > > There were specific workloads in, er, a formerly popular embedded > benchmark that benefitted significantly from *some* spilling. > > This comment probably sums up the trade-off best: > > This pressure cost is deliberately timid. The intention has been > to choose a heuristic that rarely interferes with the normal list > scheduler in cases where that scheduler would produce good code. > We simply want to curb some of its worst excesses. > > Because it was tuned for an in-order core, it was operating in an > environment where instruction latencies were meaningful and realistic. > So it still deferred to those to quite a big extent. This is almost > certainly too conservative for out-of-order cores. > > So... > > Vineet Gupta <vine...@rivosinc.com> writes: >> On 8/5/24 21:31, Jeff Law wrote: >>> On 8/5/24 5:35 PM, Vineet Gupta wrote: >>>> Hi Richard, >>>> >>>> I'm reaching out for some insight on PR/114729. Apologies in advance for >>>> the long email. >>>> >>>> On RISC-V we are hitting sched1 pathology on SPEC2017 Cactu where >>>> codegen spills are overwhelming the execution: disabling sched1 shaves >>>> off 1.3 trillion dynamic icounts which is about half of total execution >>>> (in user mode qemu run). >>>> >>>> I have a reduced test down to a single spill (w/ sched1 and non w/o). >>>> The full sched1 verbose log for test is attached for reference (so is >>>> the test case itself). >>>> >>>> The gist of the issue (from schedule pov) is annotated insn below: 46, >>>> 54, 55 >>>> >>>> ld a5,%lo(u)(s0) >>>> fld fa5,%lo(f)(t6) >>>> fcvt.l.d a4,fa4,rtz >>>> srli a0,a5,16 # insn 46 (ideal order 1) >>>> srli a1,a5,32 # insn 55 (ideal order 3) >>>> sh a5,%lo(_Z1sv)(a3) # insn 44 >>>> slli a4,a4,3 >>>> srli a5,a5,48 >>>> fsd fa5,%lo(e)(t0) >>>> add a4,s9,a4 >>>> sh a0,%lo(_Z1sv+2)(a3) # insn 54 (ideal order 2) >>>> sh a1,%lo(_Z1sv+4)(a3) # insn 64 >>>> sh a5,%lo(_Z1sv+6)(a3) >>>> >>>> If insn 54 were scheduled ahead of insn 55, the corresponding reg need >>>> not be allocated (and consequently not spilled) in the outer loop. >>>> There are no uses of a0 after insn 54. >>> So what does the ready queue look like at the start of whatever cycle >>> insn 46 fires on? I would expect insn 46, 55, 44, the slli & fsd just >>> based on data dependencies. >> Dump just before insn 46 is scheduled (most of them are from tail end of >> BB due to it processing insns from end) >> >> ;; Ready list after ready_sort: >> 81:50(cost=0:prio=7:delay=2:idx=12) >> 80:49(cost=0:prio=6:delay=1:idx=23) 65:44(cost=1:prio=7:idx=20) >> 55:42(cost=1:prio=7:idx=18) 94:58(cost=0:prio=2:idx=0) >> 92:56(cost=0:prio=5:idx=28) 88:54(cost=0:prio=5:idx=26) >> 44:39(cost=0:prio=6:idx=15) 46:40(cost=0:prio=7:idx=16) >> ;; 13--> b 0: i 46 r180=zxt(r170,0x10,0x10) >> :alu:@GR_REGS+1(1)@FP_REGS+0(0) >> >> 46 is at the head hence it is scheduled. >> >>>> Looking into sched1 logs (#8916): schedule_insn () is called for insn >>>> 46 and insn 54 added to ready q. >>> Presumably those happen on different cycles? I would not expect 45 to >>> enter the ready queue on some cycle N. Then on N+M insn 54 should enter >>> the ready queue, prsumably with M == 1. >> Nope they are in same cycle since 54 is in SD_LIST_FORW of insn 46. >> >> advance = schedule_insn(insn = 46) >> for (sd_it = sd_iterator_start (insn, SD_LIST_FORW); >> sd_iterator_cond (&sd_it, &dep);) >> try_ready >> fix_tick_ready >> change_queue_index(insn = 54, delay=QUEUE_READY) >> >>>> Then we have ready_sort -> ready_sort_real () -> model_set_excess_costs >>>> () called for insns in ready q. >>>> >>>> For insn 54, model_excess_cost () there is a reg dead, hence the -1 in >>>> print below. However its model_excess_group_cost () is still 0, >>>> disregarding the delta -1. >>>> >>>> ;; | 17 54 | 6 +1 | GR_REGS:[-1 base cost 0] FP_REGS:[0 >>>> base cost 0] >>>> >>>> Per comments around model_set_excess_costs () this seems intentional / >>>> as designed "... negative costs are converted to zero ones ...". >>>> >>>> The subsequent qsort () w/ numerous gyrations of rank_for_schedule() >>>> ends up moving 55 to top. >>> Presumably due to the number of uses and the types of uses. >> Its the 3 things in following order: >> >> RFS_PRESSURE_DELAY (low wins): ECC(tmp) + insn_delay(tmp) - ECC (tmp2) >> - insn_delay (tmp2) >> RFS_PRIORITY (high wins): INSN_PRIORITY (tmp2) - INSN_PRIORITY (tmp) >> RFS_PRESSURE_INDEX (low wins): model_index (tmp) - model_index (tmp2) >> >>>> ;; Ready list (t = 14): >>>> 81:50(cost=0:prio=7:delay=1:idx=12) >>>> 94:58(cost=0:prio=2:idx=0) >>>> 92:56(cost=0:prio=5:idx=28) >>>> 88:54(cost=0:prio=5:idx=26) >>>> 80:49(cost=0:prio=6:idx=23) >>>> 54:41(cost=0:prio=6:idx=17) >>>> 44:39(cost=0:prio=6:idx=15) >>>> 65:44(cost=0:prio=7:idx=20) >>>> 55:42(cost=0:prio=7:idx=18) >>> Hard to know what to make of this, most of the insns referenced in the >>> queue don't refer back to anything you've shown above. >> Yes, as most of them are from towards end of BB. These are just ready, >> but the one at the last is the sorted one which will be scheduled in the >> insn stream. >> >> >>>> This does change scheduling to move insn 44 ahead of insn 54 and in the >>>> next cycle, ideal insn 54 is picked (ahead of insn 55) avoiding the spill. >>>> >>>> So it seems it all comes down to ready_sort () -> rank_for_schedule() >>>> for deciding the winner (added prints in rfs_result) >>>> >>>> ... >>>> ;; --1-- RFS_PRIORITY winner 55 looser 54 >>>> ;; --2-- RFS_PRIORITY winner 55 looser 54 >>> Which is correct. From a critical path standpoint it's more important >>> to get insn 55 issued than it is to get insn 54 issued. But...... >>> >>>> ... >>>> ;; --1-- RFS_PRIORITY winner 55 looser 44 >>>> ;; --2-- RFS_PRIORITY winner 55 looser 44 >>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65 >>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65 >>>> ;; --2-- RFS_PRESSURE_INDEX winner 55 looser 65 >>>> >>>> Note that priority of insn 55 (prio 7, alu insn) will always be higher >>>> than insn 54 (prio 6, a store) since insn 55 SD_LIST_FWD has a >>>> store/sink (insn 64, also with a prio 6). >>>> And since ECC (model_set_excess_costs) calculations also include >>>> priority, there's a bias towards priority in the ranking. >>>> Meaning all else being constant, priority will invariably cause a win. >>> Right, that's right bias in general. But with the priorities only >>> differing by 1 unit, I would have expected the increase in register >>> pressure to be enough to overcome that bias or at least make them equal. >> The actual reg pressure delta is subsumed in the ECC value which tends >> to convert negative biases to zero. >> >>>> I did find a potential issue where it seems to be disregarding prio tmp > >>>> prio tmp2 (although this doesn't fix the issue) >>> Unsure ;-) >> Yeah sorry that was stupid. >> >> >>> Note that in rank_for_schedule we check pressure state before we check >>> priority. So it's a bit unclear why RFS_PRIORITY was selected when >>> comparing insns 55 and 54. I guess that would tend to indicate that the >>> ECC wasn't enough to make a difference: >> Indeed. ECC delta would have to be neutral for it to be skipped. The >> order of things checked is what I enumerated above. >> >> I'm currently pursuing a different trail which comes form observation >> that initial model setup concludes that pressure is 28 so with 27 >> allocable regs we are bound to spill one. >> More on that after I find something concrete. > ...I think for OoO cores, this: > > baseECC (X) could itself be used as the ECC value described above. > However, this is often too conservative, in the sense that it > tends to make high-priority instructions that increase pressure > wait too long in cases where introducing a spill would be better. > For this reason the final ECC is a priority-adjusted form of > baseECC (X). Specifically, we calculate: > > P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X) > baseP = MAX { P (X) | baseECC (X) <= 0 } > > Then: > > ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0) > > Thus an instruction's effect on pressure is ignored if it has a high > enough priority relative to the ones that don't increase pressure. > Negative values of baseECC (X) do not increase the priority of X > itself, but they do make it harder for other instructions to > increase the pressure further. > > is probably not appropriate. We should probably just use the baseECC, > as suggested by the first sentence in the comment. It looks like the hack: > > diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc > index 1bc610f9a5f..9601e929a88 100644 > --- a/gcc/haifa-sched.cc > +++ b/gcc/haifa-sched.cc > @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count) > print_p = true; > } > cost = model_excess_cost (insns[i], print_p); > - if (cost <= 0) > + if (cost <= 0 && 0) > { > priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost; > priority_base = MAX (priority_base, priority); > @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count) > > /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each > instruction. */ > + if (0) > for (i = 0; i < count; i++) > { > cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]); > > fixes things for me. Perhaps we should replace these && 0s > with a query for an out-of-order core? > > I haven't benchmarked this. :) And I'm looking at the code for the > first time in many years, so I'm certainly forgetting details.
Back to the original issue, I reran reduce with the hack and see the issue still. I've updated the PR but here's the gist essentially: -fno-schedule-insns | -fschedule-insns | li t3,2 | li t5,2 li t1,1 | li t4,1 ---xxx--- | sd s10,8(sp) # %sfp .L2: | .L2: ---xxx--- | ld a5,8(sp) # %sfp beq s10,zero,.L9 | beq a5,zero,.L9 mv a1,s6 | mv a1,a6 beq s6,zero,.L7 | beq a6,zero,.L7 mulw a2,s5,s6 | mulw a2,a0,a6 mv a0,s5 | mv s10,a0 slli a6,s5,3 | slli s11,a0,3 slli a3,a2,3 | slli a3,a2,3 .L6: | .L6: fcvt.d.w fa5,a2 # 56 | fcvt.d.w fa5,a2 # 56 add a5,s4,a3 # 63 | add a4,s5,a3 # 63 fsd fa5,%lo(e)(t6) # 57 | add a5,s6,a3 # 58 fld fa4,0(a5) # 64 | fsd fa5,%lo(e)(t2) # 57 fld fa5,%lo(o)(t5) | fld fa3,0(a4) # 64 add a5,s3,a3 # 58 | fld fa4,0(a5) # 60 fmul.d fa4,fa4,fa5 | fld fa5,%lo(o)(t0) fld fa5,0(a5) # 60 | fcvt.l.d a4,fa3,rtz schedule_insn () is called as follows: ;; 0--> b 0: i 56 r210=flt(r185#0) :alu: GR_REGS+0(0) FP_REGS+1(1):model 0 ;; 1--> b 0: i 63 r215=r141+r183 :alu: GR_REGS+1(1) FP_REGS+0(0) ;; 2--> b 0: i 58 r211=r138+r183 :alu:@GR_REGS+1(1)@FP_REGS+0(0) The prints between insn 63 and insn 58 show the issue ;; point uid prio delay class : del ECC-gpr del ECC-fpr ECC-tot ;;| 1 57 | 12 +2 | GR_REGS:[0 base cost 0] FP_REGS:[-1 base cost 0]ECC 0 ;;| 10 61 | 4 +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0] ECC 0 ;;| 8 58 | 5 +1 | GR_REGS:[1 base cost 0] FP_REGS:[0 base cost 0] ECC 0 ;; --1-- RFS_PRESSURE_DELAY winner 58 looser 57 ;; --2-- RFS_PRESSURE_DELAY winner 58 looser 57 (1) insn 57 delta is -1 (reduces pressure), while insn 58 delta is 1 (increases pressure) yet ECC is 0 for both, meaning for pressure considerations they are treated the same. This ECC "dilution" happens in model_spill_cost (). (2). insn 57 is a store (so prio 2) while insn 58 is an add (so prio 1). rank_for_schedule (): RFS_PRESSURE_DELAY chooses lower of (ECC + insn_delay) and thus picks 58 over 57 causing the issue. Thx, -Vineet