On 6/11/24 1:22 AM, Richard Biener wrote:
Absolutely. But forwarding from a smaller store to a wider load is painful
from a hardware standpoint and if we can avoid it from a codegen standpoint,
we should.
Note there's also the possibility to increase the distance between the
store and the load - in fact the time a store takes to a) retire and
b) get from the store buffers to where the load-store unit would pick it
up (L1-D) is another target specific tuning knob. That said, if that
distance isn't too large (on x86 there might be only an upper bound
given by the OOO window size and the L1D store latency(?), possibly
also additionally by the store buffer size) attacking the issue in
sched1 or sched2 might be another possibility. So I think pass placement
is another thing to look at - I'd definitely place it after sched1
but I guess without looking at the pass again it's way before that?
True, but I doubt there are enough instructions we could sink the load
past to make a measurable difference. This is especially true on the
class of uarchs where this is going to be most important.
In the case where the store/load can't be interchanged and thus this new
pass rejects any transformation, we could try to do something in the
scheduler to defer the load as long as possible. Essentially it's a
true dependency through a memory location using must-aliasing properties
and in that case we'd want to crank up the "latency" of the store so
that the load gets pushed away.
I think one of the difficulties here is we often model stores as not
having any latency (which is probably OK in most cases). Input data
dependencies and structural hazards dominate dominate considerations for
stores.
jeff