Hi Richard,

On 8/7/24 10:47, Richard Sandiford wrote:
> I should probably start by saying that the "model" heuristic is now
> pretty old and was originally tuned for an in-order AArch32 core.
> The aim wasn't to *minimise* spilling, but to strike a better balance
> between parallelising with spills vs. sequentialising.  At the time,
> scheduling without taking register pressure into account would overly
> parallelise things, whereas the original -fsched-pressure would overly
> serialise (i.e. was too conservative).
>
> There were specific workloads in, er, a formerly popular embedded
> benchmark that benefitted significantly from *some* spilling.
>
> This comment probably sums up the trade-off best:
>
>    This pressure cost is deliberately timid.  The intention has been
>    to choose a heuristic that rarely interferes with the normal list
>    scheduler in cases where that scheduler would produce good code.
>    We simply want to curb some of its worst excesses.
>
> Because it was tuned for an in-order core, it was operating in an
> environment where instruction latencies were meaningful and realistic.
> So it still deferred to those to quite a big extent.  This is almost
> certainly too conservative for out-of-order cores.
>
> So...
>
> Vineet Gupta <vine...@rivosinc.com> writes:
>> On 8/5/24 21:31, Jeff Law wrote:
>>> On 8/5/24 5:35 PM, Vineet Gupta wrote:
>>>> Hi Richard,
>>>>
>>>> I'm reaching out for some insight on PR/114729. Apologies in advance for
>>>> the long email.
>>>>
>>>> On RISC-V we are hitting sched1 pathology on SPEC2017 Cactu where
>>>> codegen spills are overwhelming the execution: disabling sched1 shaves
>>>> off 1.3 trillion dynamic icounts which is about half of total execution
>>>> (in user mode qemu run).
>>>>
>>>> I have a reduced test down to a single spill (w/ sched1 and non w/o).
>>>> The full sched1 verbose log for test is attached for reference (so is
>>>> the test case itself).
>>>>
>>>> The gist of the issue (from schedule pov) is annotated insn below: 46,
>>>> 54, 55
>>>>
>>>>      ld    a5,%lo(u)(s0)
>>>>      fld    fa5,%lo(f)(t6)
>>>>      fcvt.l.d a4,fa4,rtz
>>>>      srli    a0,a5,16                  # insn 46  (ideal order 1)
>>>>      srli    a1,a5,32                  # insn 55  (ideal order 3)
>>>>      sh    a5,%lo(_Z1sv)(a3)           # insn 44
>>>>      slli    a4,a4,3
>>>>      srli    a5,a5,48
>>>>      fsd    fa5,%lo(e)(t0)
>>>>      add    a4,s9,a4
>>>>      sh    a0,%lo(_Z1sv+2)(a3)        # insn 54  (ideal order 2)
>>>>      sh    a1,%lo(_Z1sv+4)(a3)        # insn 64
>>>>      sh    a5,%lo(_Z1sv+6)(a3)
>>>>
>>>> If insn 54 were scheduled ahead of insn 55, the corresponding reg need
>>>> not be allocated (and consequently not spilled) in the outer loop.
>>>> There are no uses of a0 after insn 54.
>>> So what does the ready queue look like at the start of whatever cycle 
>>> insn 46 fires on?  I would expect insn 46, 55, 44, the slli & fsd just 
>>> based on data dependencies.
>> Dump just before insn 46 is scheduled (most of them are from tail end of
>> BB due to it processing insns from end)
>>
>> ;;        Ready list after ready_sort:
>> 81:50(cost=0:prio=7:delay=2:idx=12) 
>> 80:49(cost=0:prio=6:delay=1:idx=23)  65:44(cost=1:prio=7:idx=20) 
>> 55:42(cost=1:prio=7:idx=18)  94:58(cost=0:prio=2:idx=0) 
>> 92:56(cost=0:prio=5:idx=28)  88:54(cost=0:prio=5:idx=26) 
>> 44:39(cost=0:prio=6:idx=15)  46:40(cost=0:prio=7:idx=16)
>> ;;     13--> b  0: i  46 r180=zxt(r170,0x10,0x10)               
>> :alu:@GR_REGS+1(1)@FP_REGS+0(0)
>>
>> 46 is at the head hence it is scheduled.
>>
>>>> Looking into sched1 logs (#8916):  schedule_insn () is called for insn
>>>> 46 and insn 54 added to ready q.
>>> Presumably those happen on different cycles?   I would not expect 45 to 
>>> enter the ready queue on some cycle N.  Then on N+M insn 54 should enter 
>>> the ready queue, prsumably with M == 1.
>> Nope they are in same cycle since 54 is in SD_LIST_FORW of insn 46.
>>
>>   advance = schedule_insn(insn = 46)
>>        for (sd_it = sd_iterator_start (insn, SD_LIST_FORW);
>>             sd_iterator_cond (&sd_it, &dep);)
>>             try_ready
>>               fix_tick_ready
>>                  change_queue_index(insn = 54, delay=QUEUE_READY) 
>>
>>>> Then we have ready_sort -> ready_sort_real () -> model_set_excess_costs
>>>> () called for insns in ready q.
>>>>
>>>> For insn 54, model_excess_cost () there is a reg dead, hence the -1 in
>>>> print below. However its model_excess_group_cost () is still 0,
>>>> disregarding the delta -1.
>>>>
>>>> ;;        |  17   54 |    6  +1 | GR_REGS:[-1 base cost 0] FP_REGS:[0
>>>> base cost 0]
>>>>
>>>> Per comments around model_set_excess_costs () this seems intentional /
>>>> as designed "... negative costs are converted to zero ones ...".
>>>>
>>>> The subsequent qsort () w/ numerous gyrations of rank_for_schedule()
>>>> ends up moving 55 to top.
>>> Presumably due to the number of uses and the types of uses.
>> Its the 3 things in following order:
>>
>> RFS_PRESSURE_DELAY  (low wins): ECC(tmp) + insn_delay(tmp) - ECC (tmp2)
>> - insn_delay (tmp2)
>> RFS_PRIORITY (high wins): INSN_PRIORITY (tmp2) - INSN_PRIORITY (tmp)
>> RFS_PRESSURE_INDEX (low wins): model_index (tmp) - model_index (tmp2)
>>
>>>> ;;    Ready list (t =  14):
>>>> 81:50(cost=0:prio=7:delay=1:idx=12)
>>>> 94:58(cost=0:prio=2:idx=0)
>>>> 92:56(cost=0:prio=5:idx=28)
>>>> 88:54(cost=0:prio=5:idx=26)
>>>> 80:49(cost=0:prio=6:idx=23)
>>>> 54:41(cost=0:prio=6:idx=17)
>>>> 44:39(cost=0:prio=6:idx=15)
>>>> 65:44(cost=0:prio=7:idx=20)
>>>> 55:42(cost=0:prio=7:idx=18)
>>> Hard to know what to make of this, most of the insns referenced in the 
>>> queue don't refer back to anything you've shown above.
>> Yes, as most of them are from towards end of BB. These are just ready,
>> but the one at the last is the sorted one which will be scheduled in the
>> insn stream.
>>
>>
>>>> This does change scheduling to move insn 44 ahead of insn 54 and in the
>>>> next cycle, ideal insn 54 is picked (ahead of insn 55) avoiding the spill.
>>>>
>>>> So it seems it all comes down to ready_sort () -> rank_for_schedule()
>>>> for deciding the winner (added prints in rfs_result)
>>>>
>>>> ...
>>>> ;; --1-- RFS_PRIORITY winner 55 looser 54
>>>> ;; --2-- RFS_PRIORITY winner 55 looser 54
>>> Which is correct.  From a critical path standpoint it's more important 
>>> to get insn 55 issued than it is to get insn 54 issued. But......
>>>
>>>> ...
>>>> ;; --1-- RFS_PRIORITY winner 55 looser 44
>>>> ;; --2-- RFS_PRIORITY winner 55 looser 44
>>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>> ;; --1-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>> ;; --2-- RFS_PRESSURE_INDEX winner 55 looser 65
>>>>
>>>> Note that priority of insn 55 (prio 7, alu insn) will always be higher
>>>> than insn 54 (prio 6, a store) since insn 55 SD_LIST_FWD has a
>>>> store/sink (insn 64, also with a prio 6).
>>>> And since ECC (model_set_excess_costs) calculations also include
>>>> priority, there's a bias towards priority in the ranking.
>>>> Meaning all else being constant, priority will invariably cause a win.
>>> Right, that's right bias in general.  But with the priorities only 
>>> differing by 1 unit, I would have expected the increase in register 
>>> pressure to be enough to overcome that bias or at least make them equal.
>> The actual reg pressure delta is subsumed in the ECC value which tends
>> to convert negative biases to zero.
>>
>>>> I did find a potential issue where it seems to be disregarding prio tmp > 
>>>> prio tmp2 (although this doesn't fix the issue)
>>> Unsure ;-)
>> Yeah sorry that was stupid.
>>
>>
>>> Note that in rank_for_schedule we check pressure state before we check 
>>> priority.  So it's a bit unclear why RFS_PRIORITY was selected when 
>>> comparing insns 55 and 54.  I guess that would tend to indicate that the 
>>> ECC wasn't enough to make a difference:
>> Indeed. ECC delta would have to be neutral for it to be skipped. The
>> order of things checked is what I enumerated above.
>>
>> I'm currently pursuing a different trail which comes form observation
>> that initial model setup concludes that pressure is 28 so with 27
>> allocable regs we are bound to spill one.
>> More on that after I find something concrete.
> ...I think for OoO cores, this:
>
>    baseECC (X) could itself be used as the ECC value described above.
>    However, this is often too conservative, in the sense that it
>    tends to make high-priority instructions that increase pressure
>    wait too long in cases where introducing a spill would be better.
>    For this reason the final ECC is a priority-adjusted form of
>    baseECC (X).  Specifically, we calculate:
>
>      P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
>      baseP = MAX { P (X) | baseECC (X) <= 0 }
>
>    Then:
>
>      ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)
>
>    Thus an instruction's effect on pressure is ignored if it has a high
>    enough priority relative to the ones that don't increase pressure.
>    Negative values of baseECC (X) do not increase the priority of X
>    itself, but they do make it harder for other instructions to
>    increase the pressure further.
>
> is probably not appropriate.  We should probably just use the baseECC,
> as suggested by the first sentence in the comment.  It looks like the hack:
>
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 1bc610f9a5f..9601e929a88 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>           print_p = true;
>         }
>       cost = model_excess_cost (insns[i], print_p);
> -     if (cost <= 0)
> +     if (cost <= 0 && 0)
>         {
>           priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
>           priority_base = MAX (priority_base, priority);
> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>  
>    /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
>       instruction.  */
> +  if (0)
>    for (i = 0; i < count; i++)
>      {
>        cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);
>
> fixes things for me.  Perhaps we should replace these && 0s
> with a query for an out-of-order core?
>
> I haven't benchmarked this. :)  And I'm looking at the code for the
> first time in many years, so I'm certainly forgetting details.

Back to the original issue, I reran reduce with the hack and see the issue 
still.
I've updated the PR but here's the gist essentially:

        -fno-schedule-insns             |      -fschedule-insns
                                        |
        li      t3,2                    |       li      t5,2
        li      t1,1                    |       li      t4,1
              ---xxx---                 |       sd      s10,8(sp)       # %sfp
.L2:                                    | .L2:
              ---xxx---                 |       ld      a5,8(sp)        # %sfp
        beq     s10,zero,.L9            |       beq     a5,zero,.L9             
        mv      a1,s6                   |       mv      a1,a6                   
        beq     s6,zero,.L7             |       beq     a6,zero,.L7             
        mulw    a2,s5,s6                |       mulw    a2,a0,a6                
        mv      a0,s5                   |       mv      s10,a0                  
        slli    a6,s5,3                 |       slli    s11,a0,3                
        slli    a3,a2,3                 |       slli    a3,a2,3                 
.L6:                                    | .L6:
        fcvt.d.w fa5,a2         # 56    |       fcvt.d.w fa5,a2         # 56
        add     a5,s4,a3        # 63    |       add     a4,s5,a3        # 63
        fsd     fa5,%lo(e)(t6)  # 57    |       add     a5,s6,a3        # 58
        fld     fa4,0(a5)       # 64    |       fsd     fa5,%lo(e)(t2)  # 57
        fld     fa5,%lo(o)(t5)          |       fld     fa3,0(a4)       # 64
        add     a5,s3,a3        # 58    |       fld     fa4,0(a5)       # 60
        fmul.d  fa4,fa4,fa5             |       fld     fa5,%lo(o)(t0)  
        fld     fa5,0(a5)       # 60    |       fcvt.l.d a4,fa3,rtz

schedule_insn () is called as follows:

;; 0--> b  0: i  56 r210=flt(r185#0) :alu: GR_REGS+0(0) FP_REGS+1(1):model 0
;; 1--> b  0: i  63 r215=r141+r183   :alu: GR_REGS+1(1) FP_REGS+0(0)  
;; 2--> b  0: i  58 r211=r138+r183   :alu:@GR_REGS+1(1)@FP_REGS+0(0)


The prints between insn 63 and insn 58 show the issue

;;  point uid  prio delay  class : del   ECC-gpr          del  ECC-fpr    
ECC-tot
;;|   1   57 |   12  +2 | GR_REGS:[0 base cost 0] FP_REGS:[-1 base cost 0]ECC 0
;;|  10   61 |    4  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0] ECC 0
;;|   8   58 |    5  +1 | GR_REGS:[1 base cost 0] FP_REGS:[0 base cost 0] ECC 0

;; --1-- RFS_PRESSURE_DELAY winner 58 looser 57
;; --2-- RFS_PRESSURE_DELAY winner 58 looser 57

(1) insn 57 delta is -1 (reduces pressure), while insn 58 delta is 1 (increases 
pressure) yet ECC is 0 for both, meaning for pressure considerations
they are treated the same. This ECC "dilution" happens in model_spill_cost ().

(2). insn 57 is a store (so prio 2) while insn 58 is an add (so prio 1).

rank_for_schedule (): RFS_PRESSURE_DELAY chooses lower of (ECC + insn_delay) 
and thus picks 58 over 57 causing the issue.

Thx,
-Vineet

Reply via email to