Vineet Gupta <vine...@rivosinc.com> writes:
> On 8/7/24 12:28, Jeff Law wrote:
>> On 8/7/24 11:47 AM, Richard Sandiford wrote:
>>> I should probably start by saying that the "model" heuristic is now
>>> pretty old and was originally tuned for an in-order AArch32 core.
>>> The aim wasn't to *minimise* spilling, but to strike a better balance
>>> between parallelising with spills vs. sequentialising.  At the time,
>>> scheduling without taking register pressure into account would overly
>>> parallelise things, whereas the original -fsched-pressure would overly
>>> serialise (i.e. was too conservative).
>>>
>>> There were specific workloads in, er, a formerly popular embedded
>>> benchmark that benefitted significantly from *some* spilling.
>>>
>>> This comment probably sums up the trade-off best:
>>>
>>>     This pressure cost is deliberately timid.  The intention has been
>>>     to choose a heuristic that rarely interferes with the normal list
>>>     scheduler in cases where that scheduler would produce good code.
>>>     We simply want to curb some of its worst excesses.
>>>
>>> Because it was tuned for an in-order core, it was operating in an
>>> environment where instruction latencies were meaningful and realistic.
>>> So it still deferred to those to quite a big extent.  This is almost
>>> certainly too conservative for out-of-order cores.
>> What's interesting here is that the increased spilling roughly doubles 
>> the number of dynamic instructions we have to execute for the benchmark. 
>>   While a good uarch design can hide a lot of that overhead, it's still 
>> crazy bad.
>
> [snip...]
>
>>> ...I think for OoO cores, this:
>>>
>>>     baseECC (X) could itself be used as the ECC value described above.
>>>     However, this is often too conservative, in the sense that it
>>>     tends to make high-priority instructions that increase pressure
>>>     wait too long in cases where introducing a spill would be better.
>>>     For this reason the final ECC is a priority-adjusted form of
>>>     baseECC (X).  Specifically, we calculate:
>>>
>>>       P (X) = INSN_PRIORITY (X) - insn_delay (X) - baseECC (X)
>>>       baseP = MAX { P (X) | baseECC (X) <= 0 }
>>>
>>>     Then:
>>>
>>>       ECC (X) = MAX (MIN (baseP - P (X), baseECC (X)), 0)
>>>
>>>     Thus an instruction's effect on pressure is ignored if it has a high
>>>     enough priority relative to the ones that don't increase pressure.
>>>     Negative values of baseECC (X) do not increase the priority of X
>>>     itself, but they do make it harder for other instructions to
>>>     increase the pressure further.
>>>
>>> is probably not appropriate.  We should probably just use the baseECC,
>>> as suggested by the first sentence in the comment.  It looks like the hack:
>>>
>>> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
>>> index 1bc610f9a5f..9601e929a88 100644
>>> --- a/gcc/haifa-sched.cc
>>> +++ b/gcc/haifa-sched.cc
>>> @@ -2512,7 +2512,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>>>         print_p = true;
>>>       }
>>>     cost = model_excess_cost (insns[i], print_p);
>>> -   if (cost <= 0)
>>> +   if (cost <= 0 && 0)
>>>       {
>>>         priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
>>>         priority_base = MAX (priority_base, priority);
>>> @@ -2525,6 +2525,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
>>>   
>>>     /* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
>>>        instruction.  */
>>> +  if (0)
>>>     for (i = 0; i < count; i++)
>>>       {
>>>         cost = INSN_REG_PRESSURE_EXCESS_COST_CHANGE (insns[i]);
>>>
>>> fixes things for me.  Perhaps we should replace these && 0s
>>> with a query for an out-of-order core?
>
> Yes removing this heuristics does improves things but unfortunately it seems 
> there's more in sched1 that needs unraveling - Jeff is right after all :-)
>
>                                                                     |  
> upstream  | -fno-schedule |  Patch  |
>                                                                     |         
>    |    -insns     |         |
>                                                                     |         
>    |               |         |
> _ZL24ML_BSSN_Dissipation_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |    
> 55,702  |        43,132 |  45,788 |
> _ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv      |   
> 144,278  |        59,204 | 132,588 |
> _ZL24ML_BSSN_constraints_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv |   
> 321,476  |       138,074 | 253,206 |
> _ZL16ML_BSSN_RHS_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd.lto_priv         |   
> 483,794  |       179,694 | 360,286 |
>
>
>
>>>
>>> I haven't benchmarked this. :)  And I'm looking at the code for the
>>> first time in many years, so I'm certainly forgetting details.
>> Well, I think we're probably too focused on ooo vs in-order.  The 
>> badness we're seeing I think would likely trigger on the in-order risc-v 
>> implementations out there as well.  Vineet, just to be sure, what core 
>> are we scheduling for in your spec tests?
>
> I'm not specifying anything so it has to be default mtune of rocket which is 
> not ooo (this is all upstream so not with rivos cpu tune param)

I think the question is more: which kind of core is the code being run on?
(Rather than: what is GCC tuning for?)  What we're trying to measure is
how spilling/not spilling affects the recorded cycle count.

I don't think instruction counts are a good proxy for performance on
in-order cores, especially when it comes to spilling.

> But see below....
>
>
>> I'm sure Vineet will correct me if I'm wrong but I think this is all 
>> about spilling of address computations.  One of the 2nd order questions 
>> I've had is whether or not it's a cascading effect in actual benchmark. 
>> ie, we spill so that we can compute some address.  Does that in turn 
>> force something else to then need to spill, and so-on until we finally 
>> settle down.  If so can we characterize when that happens and perhaps 
>> make spill avoidance more aggressive when it looks like we're spilling 
>> address computations that are likely to have this kind of cascading effect.
>
> The original cactu spills are very likely due to address computations but the 
> reduced test has nothing to do with these specifics. sched1 is simply
> spilling a reg (alu+store) because the ECC computation heuristics are 
> dampening away the cost of reducing reg pressure - so IMHO I agree with 
> Richard
> that it at least needs to be made toggle'able behavior if not outrightly 
> removed. Whether we do this under the guise of ooo vs. in-order or a new
> heuristic of sched1 reduce spills is a different question.
>
> Granted I'm not sure if I quite understand how that original heuristic was 
> helping in-order cores:
>
>     "... parallelising with spills vs. sequentialising ..." I understand this 
> is all paraphrased but even semantically how could spilling aid with
> parallelizing - especially on an in-order core. Thx, -Vineet

With -fno-sched-pressure, the scheduler will tend to parallelise aggressively,
such as by placing loads as soon as the pipeline model allows.  Aggressively
parallelising is what causes the spills without -fsched-pressure.  The idea
is to strike a balance between parallelising (potentially more spills,
potentially fewer in-order pipeline stalls) and serialising (potentially
fewer spills, potentially more in-order pipeline stalls).

See https://gcc.gnu.org/legacy-ml/gcc-patches/2011-12/msg01684.html
for a write-up of how this approach came about.

I wrote the code at a previous employer, so don't have access to my notes.
I probably couldn't share the details of the specific workloads anyway,
due to the nature of the benchmark.

But conceptually, I think the reason spilling can help for in-order cores
is that scheduling tends to prioritise loads and deprioritise stores,
meaning the in-order load/store units tend to be underutilised in the
middle of a loop.

E.g. for the sake of a simple example, take a contrived situation in
which the core can issue:

- one vector load/store a cycle
- one vector ALU operation a cycle

Let's assume:

- loads have a latency of 3 and ALU ops a latency of 2
- good store-load forwarding
- there are only 2 architected vector registers(!)

Like I say, it's all for the sake of a simple example only.

Let's assume that the input code is:

  I1: R1 = *PTR1
  I2: R2 = op(R1)
  I3: R3 = op(R2)
  I4: R4 = op(R3)
  I5: R5 = op(R3)
  I6: R6 = op(R4,R5)
  I7: R7 = *PTR2
  I8: R8 = op(R7)
  I9: R9 = op(R6,R8)
  IA: *PTR3 = R9

This doesn't spill, and would issue as follows:

  -- I1
  -- --
  -- --
  I2 --
  -- --
  I3 --
  -- --
  I4 --
  I5 --
  -- --
  I6 I7
  -- --
  -- --
  I8 --
  -- --
  I9 --
  -- --
  -- IA

(18 cycles).  If we allow a spill, we can get:

  -- I1
  -- I7
  -- --
  I2 --
  I8 --
  I3 --
  -- XX
  I4 --
  I5 --
  -- --
  I6 YY
  -- --
  -- --
  I9 --
  -- --
  -- IA

(16 cycles, XX = spill, YY = reload).  Spilling has then improved
performance by ~11%.

Thanks,
Richard

Reply via email to