On 10/30/24 10:25, Jeff Law wrote: > On 10/30/24 9:31 AM, Richard Sandiford wrote: >> That might need some finessing of the name. But I think the concept >> is right. I'd rather base the hook (or param) on a general concept >> like that rather than a specific "wide vs narrow" thing. > Agreed. Naming was my real only concern about the first patch.
We are leaning towards - TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE - targetm.sched.pressure_spill_aggressive Targets could wire them up however they like >>> I still see Vineet's data as compelling, even with GIGO concern. >> Do you mean the reduction in dynamic instruction counts? If so, >> that isn't what the algorithm is aiming to reduce. Like I mentioned >> in the previous thread, trying to minimise dynamic instruction counts >> was also harmful for the core & benchmarks I was looking at. >> We just ended up with lots of pipeline bubbles that could be >> alleviated by judicious spilling. > Vineet showed significant cycle and icount improvements. I'm much more > interested in the former :-) The initial premise indeed was icounts but with recent access to some credible hardware I'm all for perf measurement now. Please look at patch 2/4 [1] for actual perf data both cycles and instructions. I kept 1/4 introducing hook seperate from 2/4 which implements the hook for RISC-V. [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665945.html As Jeff mentioned on a In-order RISC-V core, are we are seeing 6% cycle improvements from the hook and another 6% cycles improvement from patch 3/4 Also Wilco gave this a spin on high end OoO Neoverse and seems to be seeing 20% improvement which I gather is cycles. > I'm planning to run it on our internal design, but it's not the top of > the priority list and it's a scarce resource right now... I fully > expect it'll show a cycle improvement there too, though probably much > smaller than the improvement seen on that spacemit k1 design. That would be another out of order data point. >> I'm not saying that the algorithm gets the decision right for cactu >> when tuning for in-order CPU X and running on that same CPU X. >> But it seems like that combination hasn't been tried, and that, >> even on the combinations that the patch has been tried on, the cactu >> justification is based on static properties of the binary rather than >> a particular runtime improvement (Y% faster). I'd requested Wilco to possibly try this on some in-order arm cores. I have a couple of RPI's lying around. RPI3 seems to have A53 which per specs is an partial dual issue yet in-order core. I also presume gcc has a fairly accurate model of A53 pipeline. I planning to bake a vanilla -mpcu="cortex-a53+neon" build and give that a spin - do I need to anything extra for cpu/uarch. >> To be clear, the two paragraphs above are trying to explain why I think >> this should be behind a hook or param rather than unconditional. The >> changes themselves look fine, and incorporate the suggestions from the >> previous thread (thanks!). > Thanks for that clarifying statement. I actually think we're broadly in > agreement here -- keep it as a hook/param rather than making it > unconditional. > > Assuming we keep it as a hook/param, opt-in & come up with better > name/docs, any objections from your side? Yes and also please take a look at 3/4 which is a different fix altogether and not gated behind any hook. Thx, -Vineet