Re: [PATCH v2] Target-independent store forwarding avoidance.

Richard Biener Tue, 11 Jun 2024 23:47:29 -0700

On Tue, 11 Jun 2024, Jeff Law wrote:

> 
> 
> On 6/11/24 7:52 AM, Philipp Tomsich wrote:
> > On Tue, 11 Jun 2024 at 15:37, Jeff Law <jeffreya...@gmail.com> wrote:
> >>
> >>
> >>
> >> On 6/11/24 1:22 AM, Richard Biener wrote:
> >>
> >>>> Absolutely.   But forwarding from a smaller store to a wider load is
> >>>> painful
> >>>> from a hardware standpoint and if we can avoid it from a codegen
> >>>> standpoint,
> >>>> we should.
> >>>
> >>> Note there's also the possibility to increase the distance between the
> >>> store and the load - in fact the time a store takes to a) retire and
> >>> b) get from the store buffers to where the load-store unit would pick it
> >>> up (L1-D) is another target specific tuning knob.  That said, if that
> >>> distance isn't too large (on x86 there might be only an upper bound
> >>> given by the OOO window size and the L1D store latency(?), possibly
> >>> also additionally by the store buffer size) attacking the issue in
> >>> sched1 or sched2 might be another possibility.  So I think pass placement
> >>> is another thing to look at - I'd definitely place it after sched1
> >>> but I guess without looking at the pass again it's way before that?
> >> True, but I doubt there are enough instructions we could sink the load
> >> past to make a measurable difference.  This is especially true on the
> >> class of uarchs where this is going to be most important.
> >>
> >> In the case where the store/load can't be interchanged and thus this new
> >> pass rejects any transformation, we could try to do something in the
> >> scheduler to defer the load as long as possible.  Essentially it's a
> >> true dependency through a memory location using must-aliasing properties
> >> and in that case we'd want to crank up the "latency" of the store so
> >> that the load gets pushed away.
> >>
> >> I think one of the difficulties here is we often model stores as not
> >> having any latency (which is probably OK in most cases).  Input data
> >> dependencies and structural hazards dominate dominate considerations for
> >> stores.
> > 
> > I don't think that TARGET_SCHED_ADJUST_COST would even be called for a
> > data-dependence through a memory location.
> Probably correct, but we could adjust that behavior or add another mechanism
> to adjust costs based on memory dependencies.
> 
> > 
> > Note that, strictly speaking, the store does not have an extended
> > latency; it will be the load that will have an increased latency
> > (almost as if we knew that the load will miss to one of the outer
> > points-of-coherence).  The difference being that the load would not
> > hang around in a scheduling queue until being dispatched, but its
> > execution would start immediately and take more cycles (and
> > potentially block an execution pipeline for longer).
> Absolutely true.  I'm being imprecise in my language, increasing the "latency"
> of the store is really a proxy for "do something to encourage the load to move
> away from the store".
> 
> But overall rewriting the sequence is probably the better choice.  In my mind
> the scheduler approach would be a secondary attempt if we couldn't interchange
> the store/load.  And I'd make a small bet that its impact would be on the
> margins if we're doing a reasonable job in the new pass.


One of the points I wanted to make is that sched1 can make quite a
difference as to the relative distance of the store and load and
we have the instruction window the pass considers when scanning
(possibly driven by target uarch details).  So doing the rewriting
before sched1 might be not ideal (but I don't know how much cleanup
work the pass leaves behind - there's nothing between sched1 and RA).

On the hardware side I always wondered whether a failed load-to-store
forward results in the load uop stalling (because the hardware actually
_did_ see the conflict with an in-flight store) or whether this gets
catched later as the hardware speculates a load from L1 (with the
wrong value) but has to roll back because of the conflict.  I would
imagine the latter is cheaper to implement but worse in case of
conflict.

Richard.

Re: [PATCH v2] Target-independent store forwarding avoidance.

Reply via email to