Jeff Law <jeffreya...@gmail.com> writes: > On 10/30/24 8:44 AM, Richard Sandiford wrote: > >>> But the data from the BPI (spacemit k1 chip) is an in-order core. >>> Granted we don't have a good model of its pipeline, but it's definitely >>> in-order. >> >> Damn :) (I did try to clarify what was being tested earlier, but the >> response wasn't clear.) >> >> So how representative is the DFA model being used for the BPI? >> Is it more "pretty close, but maybe different in a few minor details"? >> Or is it more "we're just using an existing DFA model for a different >> core and hoping for the best"? Is the issue width accurate? >> >> If we're scheduling for an in-order core without an accurate pipeline >> model then that feels like the first thing to fix. Otherwise we're >> in danger of GIGO. > GIGO is a risk here -- there really isn't good data on the pipeline for > that chip, especially on the FP side. I don't really have a good way to > test this on an in-order RISC-V target where there is a reasonable DFA > model.
OK (and yeah, I can sympathise). But I think there's an argument that, if you're scheduling for one in-order core using the pipeline of an unrelated core, that's effectively scheduling for the core as though it were out-of-order. In other words, the property we care about isn't so much whether the processor itself is in-order (a statement about the uarch), but whether we trying to schedule for a particular in-order pipeline (a statement about what GCC is doing or knows about). I'd argue that in the case you describe, we're not trying to schedule for a particular in-order pipeline. That might need some finessing of the name. But I think the concept is right. I'd rather base the hook (or param) on a general concept like that rather than a specific "wide vs narrow" thing. > I still see Vineet's data as compelling, even with GIGO concern. Do you mean the reduction in dynamic instruction counts? If so, that isn't what the algorithm is aiming to reduce. Like I mentioned in the previous thread, trying to minimise dynamic instruction counts was also harmful for the core & benchmarks I was looking at. We just ended up with lots of pipeline bubbles that could be alleviated by judicious spilling. I'm not saying that the algorithm gets the decision right for cactu when tuning for in-order CPU X and running on that same CPU X. But it seems like that combination hasn't been tried, and that, even on the combinations that the patch has been tried on, the cactu justification is based on static properties of the binary rather than a particular runtime improvement (Y% faster). To be clear, the two paragraphs above are trying to explain why I think this should be behind a hook or param rather than unconditional. The changes themselves look fine, and incorporate the suggestions from the previous thread (thanks!). Richard