Jeff Law <jeffreya...@gmail.com> writes:
> On 10/30/24 8:44 AM, Richard Sandiford wrote:
>
>>> But the data from the BPI (spacemit k1 chip) is an in-order core.
>>> Granted we don't have a good model of its pipeline, but it's definitely
>>> in-order.
>> 
>> Damn :)  (I did try to clarify what was being tested earlier, but the
>> response wasn't clear.)
>> 
>> So how representative is the DFA model being used for the BPI?
>> Is it more "pretty close, but maybe different in a few minor details"?
>> Or is it more "we're just using an existing DFA model for a different
>> core and hoping for the best"?  Is the issue width accurate?
>> 
>> If we're scheduling for an in-order core without an accurate pipeline
>> model then that feels like the first thing to fix.  Otherwise we're
>> in danger of GIGO.
> GIGO is a risk here -- there really isn't good data on the pipeline for 
> that chip, especially on the FP side.  I don't really have a good way to 
> test this on an in-order RISC-V target where there is a reasonable DFA 
> model.

OK (and yeah, I can sympathise).  But I think there's an argument that,
if you're scheduling for one in-order core using the pipeline of an
unrelated core, that's effectively scheduling for the core as though
it were out-of-order.  In other words, the property we care about
isn't so much whether the processor itself is in-order (a statement
about the uarch), but whether we trying to schedule for a particular
in-order pipeline (a statement about what GCC is doing or knows about).
I'd argue that in the case you describe, we're not trying to schedule
for a particular in-order pipeline.

That might need some finessing of the name.  But I think the concept
is right.  I'd rather base the hook (or param) on a general concept
like that rather than a specific "wide vs narrow" thing.

> I still see Vineet's data as compelling, even with GIGO concern.

Do you mean the reduction in dynamic instruction counts?  If so,
that isn't what the algorithm is aiming to reduce.  Like I mentioned
in the previous thread, trying to minimise dynamic instruction counts
was also harmful for the core & benchmarks I was looking at.
We just ended up with lots of pipeline bubbles that could be
alleviated by judicious spilling.

I'm not saying that the algorithm gets the decision right for cactu
when tuning for in-order CPU X and running on that same CPU X.
But it seems like that combination hasn't been tried, and that,
even on the combinations that the patch has been tried on, the cactu
justification is based on static properties of the binary rather than
a particular runtime improvement (Y% faster).

To be clear, the two paragraphs above are trying to explain why I think
this should be behind a hook or param rather than unconditional.  The
changes themselves look fine, and incorporate the suggestions from the
previous thread (thanks!).

Richard

Reply via email to