On 8/7/24 12:28, Jeff Law wrote:
> On 8/7/24 11:47 AM, Richard Sandiford wrote:
>> I should probably start by saying that the "model" heuristic is now
>> pretty old and was originally tuned for an in-order AArch32 core.
>> The aim wasn't to *minimise* spilling, but to strike a better balance
>> between parallelising with spills vs. sequentialising.  At the time,
>> scheduling without taking register pressure into account would overly
>> parallelise things, whereas the original -fsched-pressure would overly
>> serialise (i.e. was too conservative).
>>
>> There were specific workloads in, er, a formerly popular embedded
>> benchmark that benefitted significantly from *some* spilling.
>>
>> This comment probably sums up the trade-off best:
>>
>>     This pressure cost is deliberately timid.  The intention has been
>>     to choose a heuristic that rarely interferes with the normal list
>>     scheduler in cases where that scheduler would produce good code.
>>     We simply want to curb some of its worst excesses.
>>
>> Because it was tuned for an in-order core, it was operating in an
>> environment where instruction latencies were meaningful and realistic.
>> So it still deferred to those to quite a big extent.  This is almost
>> certainly too conservative for out-of-order cores.
> What's interesting here is that the increased spilling roughly doubles 
> the number of dynamic instructions we have to execute for the benchmark. 
>   While a good uarch design can hide a lot of that overhead, it's still 
> crazy bad.

[snip...]

>> ...I think for OoO cores, this:
>>
>>     baseECC (X) could itself be used as the ECC value described above.
>>     However, this is often too conservative, in the sense that it
>>     tends to make high-priority instructions that increase pressure
>>     wait too long in cases where introducing a spill would be better.
>>     For this reason the final ECC is a priority-adjusted form of
>>     baseECC (X).  Specifically, we calculate:
>>
>>       P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
>>       baseP = MAX { P (X) | baseECC (X) <= 0 }
>>
>>     Then:
>>
>>       ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)
>>
>>     Thus an instruction's effect on pressure is ignored if it has a high
>>     enough priority relative to the ones that don't increase pressure.
>>     Negative values of baseECC (X) do not increase the priority of X
>>     itself, but they do make it harder for other instructions to
>>     increase the pressure further.
>>
>> is probably not appropriate.  We should probably just use the baseECC,
>> as suggested by the first sentence in the comment.  It looks like the hack:
>>
>> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
>> index 1bc610f9a5f..9601e929a88 100644
>> --- a/gcc/haifa-sched.cc
>> +++ b/gcc/haifa-sched.cc
>> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>>          print_p = true;
>>        }
>>      cost = model_excess_cost (insns[i], print_p);
>> -    if (cost <= 0)
>> +    if (cost <= 0 && 0)
>>        {
>>          priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
>>          priority_base = MAX (priority_base, priority);
>> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>>   
>>     /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
>>        instruction.  */
>> +  if (0)
>>     for (i = 0; i < count; i++)
>>       {
>>         cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);
>>
>> fixes things for me.  Perhaps we should replace these && 0s
>> with a query for an out-of-order core?

Yes removing this heuristics does improves things but unfortunately it seems 
there's more in sched1 that needs unraveling - Jeff is right after all :-)

                                                                    |  upstream 
 | -fno-schedule |  Patch  |
                                                                    |           
 |    -insns     |         |
                                                                    |           
 |               |         |
_ZL24ML_BSSN_Dissipation_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |    55,702 
 |        43,132 |  45,788 |
_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv      |   144,278 
 |        59,204 | 132,588 |
_ZL24ML_BSSN_constraints_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |   321,476 
 |       138,074 | 253,206 |
_ZL16ML_BSSN_RHS_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv         |   483,794 
 |       179,694 | 360,286 |



>>
>> I haven't benchmarked this. :)  And I'm looking at the code for the
>> first time in many years, so I'm certainly forgetting details.
> Well, I think we're probably too focused on ooo vs in-order.  The 
> badness we're seeing I think would likely trigger on the in-order risc-v 
> implementations out there as well.  Vineet, just to be sure, what core 
> are we scheduling for in your spec tests?

I'm not specifying anything so it has to be default mtune of rocket which is 
not ooo (this is all upstream so not with rivos cpu tune param)
But see below....


> I'm sure Vineet will correct me if I'm wrong but I think this is all 
> about spilling of address computations.  One of the 2nd order questions 
> I've had is whether or not it's a cascading effect in actual benchmark. 
> ie, we spill so that we can compute some address.  Does that in turn 
> force something else to then need to spill, and so-on until we finally 
> settle down.  If so can we characterize when that happens and perhaps 
> make spill avoidance more aggressive when it looks like we're spilling 
> address computations that are likely to have this kind of cascading effect.

The original cactu spills are very likely due to address computations but the 
reduced test has nothing to do with these specifics. sched1 is simply
spilling a reg (alu+store) because the ECC computation heuristics are dampening 
away the cost of reducing reg pressure - so IMHO I agree with Richard
that it at least needs to be made toggle'able behavior if not outrightly 
removed. Whether we do this under the guise of ooo vs. in-order or a new
heuristic of sched1 reduce spills is a different question.

Granted I'm not sure if I quite understand how that original heuristic was 
helping in-order cores:

    "... parallelising with spills vs. sequentialising ..." I understand this 
is all paraphrased but even semantically how could spilling aid with
parallelizing - especially on an in-order core. Thx, -Vineet

Reply via email to