On 10/30/24 10:25, Jeff Law wrote:
> On 10/30/24 9:31 AM, Richard Sandiford wrote:
>> That might need some finessing of the name.  But I think the concept
>> is right.  I'd rather base the hook (or param) on a general concept
>> like that rather than a specific "wide vs narrow" thing.
> Agreed.  Naming was my real only concern about the first patch.

We are leaning towards
  - TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE
  - targetm.sched.pressure_spill_aggressive

Targets could wire them up however they like

>>> I still see Vineet's data as compelling, even with GIGO concern.
>> Do you mean the reduction in dynamic instruction counts?  If so,
>> that isn't what the algorithm is aiming to reduce.  Like I mentioned
>> in the previous thread, trying to minimise dynamic instruction counts
>> was also harmful for the core & benchmarks I was looking at.
>> We just ended up with lots of pipeline bubbles that could be
>> alleviated by judicious spilling.
> Vineet showed significant cycle and icount improvements.  I'm much more 
> interested in the former :-)

The initial premise indeed was icounts but with recent access to some credible 
hardware I'm all for perf measurement now.

Please look at patch 2/4 [1] for actual perf data both cycles and instructions.
I kept 1/4 introducing hook seperate from 2/4 which implements the hook for 
RISC-V.

    [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665945.html

As Jeff mentioned on a In-order RISC-V core, are we are seeing 6% cycle 
improvements from the hook and another 6% cycles improvement from patch 3/4

Also Wilco gave this a spin on high end OoO Neoverse and seems to be seeing 20% 
improvement which I gather is cycles.


> I'm planning to run it on our internal design, but it's not the top of 
> the priority list and it's a scarce resource right now...  I fully 
> expect it'll show a cycle improvement there too, though probably much 
> smaller than the improvement seen on that spacemit k1 design.

That would be another out of order data point.

>> I'm not saying that the algorithm gets the decision right for cactu
>> when tuning for in-order CPU X and running on that same CPU X.
>> But it seems like that combination hasn't been tried, and that,
>> even on the combinations that the patch has been tried on, the cactu
>> justification is based on static properties of the binary rather than
>> a particular runtime improvement (Y% faster).

I'd requested Wilco to possibly try this on some in-order arm cores.

I have a couple of RPI's lying around. RPI3 seems to have A53 which per specs 
is an partial dual issue yet in-order core. I also presume gcc has a
fairly accurate model of A53 pipeline.
I planning to bake a vanilla  -mpcu="cortex-a53+neon" build and give that a 
spin - do I need to anything extra for cpu/uarch.

>> To be clear, the two paragraphs above are trying to explain why I think
>> this should be behind a hook or param rather than unconditional.  The
>> changes themselves look fine, and incorporate the suggestions from the
>> previous thread (thanks!).
> Thanks for that clarifying statement.  I actually think we're broadly in 
> agreement here -- keep it as a hook/param rather than making it 
> unconditional.
>
> Assuming we keep it as a hook/param, opt-in & come up with better 
> name/docs, any objections from your side?

Yes and also please take a look at 3/4 which is a different fix altogether and 
not gated behind any hook.

Thx,
-Vineet

Reply via email to