On Mon, 10 Jun 2024, Jeff Law wrote:

> 
> 
> On 6/10/24 1:55 AM, Manolis Tsamis wrote:
> 
> >>
> > There was an older submission of a load-pair specific pass but this is
> > a complete reimplementation and indeed significantly more general.
> > Apart from being target independant, it addresses a number of
> > important restrictions and can handle multiple store forwardings per
> > load.
> > It should be noted that it cannot handle the load-pair cases as these
> > need special handling, but that's something we're planning to do in
> > the future by reusing this infrastructure.
> ACK.  Thanks for the additional background.
> 
> 
> > 
> >>
> >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> >>> index 4e8967fd8ab..c769744d178 100644
> >>> --- a/gcc/doc/invoke.texi
> >>> +++ b/gcc/doc/invoke.texi
> >>> @@ -12657,6 +12657,15 @@ loop unrolling.
> >>>    This option is enabled by default at optimization levels @option{-O1},
> >>>    @option{-O2}, @option{-O3}, @option{-Os}.
> >>>
> >>> +@opindex favoid-store-forwarding
> >>> +@item -favoid-store-forwarding
> >>> +@itemx -fno-avoid-store-forwarding
> >>> +Many CPUs will stall for many cycles when a load partially depends on
> >>> previous
> >>> +smaller stores.  This pass tries to detect such cases and avoid the
> >>> penalty by
> >>> +changing the order of the load and store and then fixing up the loaded
> >>> value.
> >>> +
> >>> +Disabled by default.
> >> Is there any particular reason why this would be off by default at -O1
> >> or higher?  It would seem to me that on modern cores that this
> >> transformation should easily be a win.  Even on an old in-order core,
> >> avoiding the load with the bit insert is likely profitable, just not as
> >> much so.
> >>
> > I don't have a strong opinion for that but I believe Richard's
> > suggestion to decide this on a per-target basis also makes a lot of
> > sense.
> > Deciding whether the transformation is profitable is tightly tied to
> > the architecture in question (i.e. how large the stall is and what
> > sort of bit-insert instructions are available).
> > In order to make this more widely applicable, I think we'll need a
> > target hook that decides in which case the forwarded stores incur a
> > penalty and thus the transformation makes sense.
> You and Richi are probably right.   I'm not a big fan of passes being
> enabled/disabled on particular targets, but it may make sense here.
> 
> 
> 
> > Afaik, for each CPU there may be cases that store forwarding is
> > handled efficiently.
> Absolutely.   But forwarding from a smaller store to a wider load is painful
> from a hardware standpoint and if we can avoid it from a codegen standpoint,
> we should.

Note there's also the possibility to increase the distance between the
store and the load - in fact the time a store takes to a) retire and
b) get from the store buffers to where the load-store unit would pick it
up (L1-D) is another target specific tuning knob.  That said, if that
distance isn't too large (on x86 there might be only an upper bound
given by the OOO window size and the L1D store latency(?), possibly
also additionally by the store buffer size) attacking the issue in
sched1 or sched2 might be another possibility.  So I think pass placement
is another thing to look at - I'd definitely place it after sched1
but I guess without looking at the pass again it's way before that?

Richard.

> Did y'all look at spec2017 at all for this patch?  I've got our hardware guys
> to expose a signal for this case so that we can (in a month or so) get some
> hard data on how often it's happening in spec2017 and evaluate how this patch
> helps the most affected workloads.  But if y'all already have some data we can
> use it as a starting point.
> 
> jeff

Reply via email to