On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 6/11/24 1:22 AM, Richard Biener wrote: > > >> Absolutely. But forwarding from a smaller store to a wider load is > >> painful > >> from a hardware standpoint and if we can avoid it from a codegen > >> standpoint, > >> we should. > > > > Note there's also the possibility to increase the distance between the > > store and the load - in fact the time a store takes to a) retire and > > b) get from the store buffers to where the load-store unit would pick it > > up (L1-D) is another target specific tuning knob. That said, if that > > distance isn't too large (on x86 there might be only an upper bound > > given by the OOO window size and the L1D store latency(?), possibly > > also additionally by the store buffer size) attacking the issue in > > sched1 or sched2 might be another possibility. So I think pass placement > > is another thing to look at - I'd definitely place it after sched1 > > but I guess without looking at the pass again it's way before that? > True, but I doubt there are enough instructions we could sink the load > past to make a measurable difference. This is especially true on the > class of uarchs where this is going to be most important. > > In the case where the store/load can't be interchanged and thus this new > pass rejects any transformation, we could try to do something in the > scheduler to defer the load as long as possible. Essentially it's a > true dependency through a memory location using must-aliasing properties > and in that case we'd want to crank up the "latency" of the store so > that the load gets pushed away. > > I think one of the difficulties here is we often model stores as not > having any latency (which is probably OK in most cases). Input data > dependencies and structural hazards dominate dominate considerations for > stores.
I don't think that TARGET_SCHED_ADJUST_COST would even be called for a data-dependence through a memory location. Note that, strictly speaking, the store does not have an extended latency; it will be the load that will have an increased latency (almost as if we knew that the load will miss to one of the outer points-of-coherence). The difference being that the load would not hang around in a scheduling queue until being dispatched, but its execution would start immediately and take more cycles (and potentially block an execution pipeline for longer). Philipp.