I should probably start by saying that the "model" heuristic is now
pretty old and was originally tuned for an in-order AArch32 core.
The aim wasn't to *minimise* spilling, but to strike a better balance
between parallelising with spills vs. sequentialising.  At the time,
scheduling without taking register pressure into account would overly
parallelise things, whereas the original -fsched-pressure would overly
serialise (i.e. was too conservative).

There were specific workloads in, er, a formerly popular embedded
benchmark that benefitted significantly from *some* spilling.

This comment probably sums up the trade-off best:

   This pressure cost is deliberately timid.  The intention has been
   to choose a heuristic that rarely interferes with the normal list
   scheduler in cases where that scheduler would produce good code.
   We simply want to curb some of its worst excesses.

Because it was tuned for an in-order core, it was operating in an
environment where instruction latencies were meaningful and realistic.
So it still deferred to those to quite a big extent.  This is almost
certainly too conservative for out-of-order cores.

So...

Vineet Gupta <vine...@rivosinc.com> writes:
> On 8/5/24 21:31, Jeff Law wrote:
>> On 8/5/24 5:35 PM, Vineet Gupta wrote:
>>> Hi Richard,
>>>
>>> I'm reaching out for some insight on PR/114729. Apologies in advance for
>>> the long email.
>>>
>>> On RISC-V we are hitting sched1 pathology on SPEC2017 Cactu where
>>> codegen spills are overwhelming the execution: disabling sched1 shaves
>>> off 1.3 trillion dynamic icounts which is about half of total execution
>>> (in user mode qemu run).
>>>
>>> I have a reduced test down to a single spill (w/ sched1 and non w/o).
>>> The full sched1 verbose log for test is attached for reference (so is
>>> the test case itself).
>>>
>>> The gist of the issue (from schedule pov) is annotated insn below: 46,
>>> 54, 55
>>>
>>>      ld    a5,%lo(u)(s0)
>>>      fld    fa5,%lo(f)(t6)
>>>      fcvt.l.d a4,fa4,rtz
>>>      srli    a0,a5,16                  # insn 46  (ideal order 1)
>>>      srli    a1,a5,32                  # insn 55  (ideal order 3)
>>>      sh    a5,%lo(_Z1sv)(a3)           # insn 44
>>>      slli    a4,a4,3
>>>      srli    a5,a5,48
>>>      fsd    fa5,%lo(e)(t0)
>>>      add    a4,s9,a4
>>>      sh    a0,%lo(_Z1sv+2)(a3)        # insn 54  (ideal order 2)
>>>      sh    a1,%lo(_Z1sv+4)(a3)        # insn 64
>>>      sh    a5,%lo(_Z1sv+6)(a3)
>>>
>>> If insn 54 were scheduled ahead of insn 55, the corresponding reg need
>>> not be allocated (and consequently not spilled) in the outer loop.
>>> There are no uses of a0 after insn 54.
>> So what does the ready queue look like at the start of whatever cycle 
>> insn 46 fires on?  I would expect insn 46, 55, 44, the slli & fsd just 
>> based on data dependencies.
>
> Dump just before insn 46 is scheduled (most of them are from tail end of
> BB due to it processing insns from end)
>
> ;;        Ready list after ready_sort:
> 81:50(cost=0:prio=7:delay=2:idx=12) 
> 80:49(cost=0:prio=6:delay=1:idx=23)  65:44(cost=1:prio=7:idx=20) 
> 55:42(cost=1:prio=7:idx=18)  94:58(cost=0:prio=2:idx=0) 
> 92:56(cost=0:prio=5:idx=28)  88:54(cost=0:prio=5:idx=26) 
> 44:39(cost=0:prio=6:idx=15)  46:40(cost=0:prio=7:idx=16)
> ;;     13--> b  0: i  46 r180=zxt(r170,0x10,0x10)               
> :alu:@GR_REGS+1(1)@FP_REGS+0(0)
>
> 46 is at the head hence it is scheduled.
>
>>> Looking into sched1 logs (#8916):  schedule_insn () is called for insn
>>> 46 and insn 54 added to ready q.
>> Presumably those happen on different cycles?   I would not expect 45 to 
>> enter the ready queue on some cycle N.  Then on N+M insn 54 should enter 
>> the ready queue, prsumably with M == 1.
>
> Nope they are in same cycle since 54 is in SD_LIST_FORW of insn 46.
>
>   advance = schedule_insn(insn = 46)
>        for (sd_it = sd_iterator_start (insn, SD_LIST_FORW);
>             sd_iterator_cond (&sd_it, &dep);)
>             try_ready
>               fix_tick_ready
>                  change_queue_index(insn = 54, delay=QUEUE_READY) 
>
>>> Then we have ready_sort -> ready_sort_real () -> model_set_excess_costs
>>> () called for insns in ready q.
>>>
>>> For insn 54, model_excess_cost () there is a reg dead, hence the -1 in
>>> print below. However its model_excess_group_cost () is still 0,
>>> disregarding the delta -1.
>>>
>>> ;;        |  17   54 |    6  +1 | GR_REGS:[-1 base cost 0] FP_REGS:[0
>>> base cost 0]
>>>
>>> Per comments around model_set_excess_costs () this seems intentional /
>>> as designed "... negative costs are converted to zero ones ...".
>>>
>>> The subsequent qsort () w/ numerous gyrations of rank_for_schedule()
>>> ends up moving 55 to top.
>> Presumably due to the number of uses and the types of uses.
>
> Its the 3 things in following order:
>
> RFS_PRESSURE_DELAY  (low wins): ECC(tmp) + insn_delay(tmp) - ECC (tmp2)
> - insn_delay (tmp2)
> RFS_PRIORITY (high wins): INSN_PRIORITY (tmp2) - INSN_PRIORITY (tmp)
> RFS_PRESSURE_INDEX (low wins): model_index (tmp) - model_index (tmp2)
>
>>> ;;    Ready list (t =  14):
>>> 81:50(cost=0:prio=7:delay=1:idx=12)
>>> 94:58(cost=0:prio=2:idx=0)
>>> 92:56(cost=0:prio=5:idx=28)
>>> 88:54(cost=0:prio=5:idx=26)
>>> 80:49(cost=0:prio=6:idx=23)
>>> 54:41(cost=0:prio=6:idx=17)
>>> 44:39(cost=0:prio=6:idx=15)
>>> 65:44(cost=0:prio=7:idx=20)
>>> 55:42(cost=0:prio=7:idx=18)
>> Hard to know what to make of this, most of the insns referenced in the 
>> queue don't refer back to anything you've shown above.
>
> Yes, as most of them are from towards end of BB. These are just ready,
> but the one at the last is the sorted one which will be scheduled in the
> insn stream.
>
>
>>
>>> This does change scheduling to move insn 44 ahead of insn 54 and in the
>>> next cycle, ideal insn 54 is picked (ahead of insn 55) avoiding the spill.
>>>
>>> So it seems it all comes down to ready_sort () -> rank_for_schedule()
>>> for deciding the winner (added prints in rfs_result)
>>>
>>> ...
>>> ;; --1-- RFS_PRIORITY winner 55 looser 54
>>> ;; --2-- RFS_PRIORITY winner 55 looser 54
>> Which is correct.  From a critical path standpoint it's more important 
>> to get insn 55 issued than it is to get insn 54 issued. But......
>>
>>> ...
>>> ;; --1-- RFS_PRIORITY winner 55 looser 44
>>> ;; --2-- RFS_PRIORITY winner 55 looser 44
>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>> ;; --2-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>
>>> Note that priority of insn 55 (prio 7, alu insn) will always be higher
>>> than insn 54 (prio 6, a store) since insn 55 SD_LIST_FWD has a
>>> store/sink (insn 64, also with a prio 6).
>>> And since ECC (model_set_excess_costs) calculations also include
>>> priority, there's a bias towards priority in the ranking.
>>> Meaning all else being constant, priority will invariably cause a win.
>> Right, that's right bias in general.  But with the priorities only 
>> differing by 1 unit, I would have expected the increase in register 
>> pressure to be enough to overcome that bias or at least make them equal.
>
> The actual reg pressure delta is subsumed in the ECC value which tends
> to convert negative biases to zero.
>
>>
>>> I did find a potential issue where it seems to be disregarding prio tmp > 
>>> prio tmp2 (although this doesn't fix the issue)
>> Unsure ;-)
>
> Yeah sorry that was stupid.
>
>
>> Note that in rank_for_schedule we check pressure state before we check 
>> priority.  So it's a bit unclear why RFS_PRIORITY was selected when 
>> comparing insns 55 and 54.  I guess that would tend to indicate that the 
>> ECC wasn't enough to make a difference:
>
> Indeed. ECC delta would have to be neutral for it to be skipped. The
> order of things checked is what I enumerated above.
>
> I'm currently pursuing a different trail which comes form observation
> that initial model setup concludes that pressure is 28 so with 27
> allocable regs we are bound to spill one.
> More on that after I find something concrete.

...I think for OoO cores, this:

   baseECC (X) could itself be used as the ECC value described above.
   However, this is often too conservative, in the sense that it
   tends to make high-priority instructions that increase pressure
   wait too long in cases where introducing a spill would be better.
   For this reason the final ECC is a priority-adjusted form of
   baseECC (X).  Specifically, we calculate:

     P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
     baseP = MAX { P (X) | baseECC (X) <= 0 }

   Then:

     ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)

   Thus an instruction's effect on pressure is ignored if it has a high
   enough priority relative to the ones that don't increase pressure.
   Negative values of baseECC (X) do not increase the priority of X
   itself, but they do make it harder for other instructions to
   increase the pressure further.

is probably not appropriate.  We should probably just use the baseECC,
as suggested by the first sentence in the comment.  It looks like the hack:

diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 1bc610f9a5f..9601e929a88 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
            print_p = true;
          }
        cost = model_excess_cost (insns[i], print_p);
-       if (cost <= 0)
+       if (cost <= 0 && 0)
          {
            priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
            priority_base = MAX (priority_base, priority);
@@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
 
   /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
      instruction.  */
+  if (0)
   for (i = 0; i < count; i++)
     {
       cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);

fixes things for me.  Perhaps we should replace these && 0s
with a query for an out-of-order core?

I haven't benchmarked this. :)  And I'm looking at the code for the
first time in many years, so I'm certainly forgetting details.

Thanks,
Richard

Reply via email to