On 6/11/24 7:52 AM, Philipp Tomsich wrote:
On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreya...@gmail.com> wrote:
On 6/11/24 1:22 AM, Richard Biener wrote:
Absolutely. But forwarding from a smaller store to a wider load is painful
from a hardware standpoint and if we can avoid it from a codegen standpoint,
we should.
Note there's also the possibility to increase the distance between the
store and the load - in fact the time a store takes to a) retire and
b) get from the store buffers to where the load-store unit would pick it
up (L1-D) is another target specific tuning knob. That said, if that
distance isn't too large (on x86 there might be only an upper bound
given by the OOO window size and the L1D store latency(?), possibly
also additionally by the store buffer size) attacking the issue in
sched1 or sched2 might be another possibility. So I think pass placement
is another thing to look at - I'd definitely place it after sched1
but I guess without looking at the pass again it's way before that?
True, but I doubt there are enough instructions we could sink the load
past to make a measurable difference. This is especially true on the
class of uarchs where this is going to be most important.
In the case where the store/load can't be interchanged and thus this new
pass rejects any transformation, we could try to do something in the
scheduler to defer the load as long as possible. Essentially it's a
true dependency through a memory location using must-aliasing properties
and in that case we'd want to crank up the "latency" of the store so
that the load gets pushed away.
I think one of the difficulties here is we often model stores as not
having any latency (which is probably OK in most cases). Input data
dependencies and structural hazards dominate dominate considerations for
stores.
I don't think that TARGET_SCHED_ADJUST_COST would even be called for a
data-dependence through a memory location.
Probably correct, but we could adjust that behavior or add another
mechanism to adjust costs based on memory dependencies.
Note that, strictly speaking, the store does not have an extended
latency; it will be the load that will have an increased latency
(almost as if we knew that the load will miss to one of the outer
points-of-coherence). The difference being that the load would not
hang around in a scheduling queue until being dispatched, but its
execution would start immediately and take more cycles (and
potentially block an execution pipeline for longer).
Absolutely true. I'm being imprecise in my language, increasing the
"latency" of the store is really a proxy for "do something to encourage
the load to move away from the store".
But overall rewriting the sequence is probably the better choice. In my
mind the scheduler approach would be a secondary attempt if we couldn't
interchange the store/load. And I'd make a small bet that its impact
would be on the margins if we're doing a reasonable job in the new pass.
Jeff