On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 6/11/24 1:22 AM, Richard Biener wrote:
>
> >> Absolutely.   But forwarding from a smaller store to a wider load is 
> >> painful
> >> from a hardware standpoint and if we can avoid it from a codegen 
> >> standpoint,
> >> we should.
> >
> > Note there's also the possibility to increase the distance between the
> > store and the load - in fact the time a store takes to a) retire and
> > b) get from the store buffers to where the load-store unit would pick it
> > up (L1-D) is another target specific tuning knob.  That said, if that
> > distance isn't too large (on x86 there might be only an upper bound
> > given by the OOO window size and the L1D store latency(?), possibly
> > also additionally by the store buffer size) attacking the issue in
> > sched1 or sched2 might be another possibility.  So I think pass placement
> > is another thing to look at - I'd definitely place it after sched1
> > but I guess without looking at the pass again it's way before that?
> True, but I doubt there are enough instructions we could sink the load
> past to make a measurable difference.  This is especially true on the
> class of uarchs where this is going to be most important.
>
> In the case where the store/load can't be interchanged and thus this new
> pass rejects any transformation, we could try to do something in the
> scheduler to defer the load as long as possible.  Essentially it's a
> true dependency through a memory location using must-aliasing properties
> and in that case we'd want to crank up the "latency" of the store so
> that the load gets pushed away.
>
> I think one of the difficulties here is we often model stores as not
> having any latency (which is probably OK in most cases).  Input data
> dependencies and structural hazards dominate dominate considerations for
> stores.

I don't think that TARGET_SCHED_ADJUST_COST would even be called for a
data-dependence through a memory location.

Note that, strictly speaking, the store does not have an extended
latency; it will be the load that will have an increased latency
(almost as if we knew that the load will miss to one of the outer
points-of-coherence).  The difference being that the load would not
hang around in a scheduling queue until being dispatched, but its
execution would start immediately and take more cycles (and
potentially block an execution pipeline for longer).

Philipp.

Reply via email to