On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 6/10/24 1:55 AM, Manolis Tsamis wrote:
>
> >>
> > There was an older submission of a load-pair specific pass but this is
> > a complete reimplementation and indeed significantly more general.
> > Apart from being target independant, it addresses a number of
> > important restrictions and can handle multiple store forwardings per
> > load.
> > It should be noted that it cannot handle the load-pair cases as these
> > need special handling, but that's something we're planning to do in
> > the future by reusing this infrastructure.
> ACK.  Thanks for the additional background.
>
>
> >
> >>
> >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> >>> index 4e8967fd8ab..c769744d178 100644
> >>> --- a/gcc/doc/invoke.texi
> >>> +++ b/gcc/doc/invoke.texi
> >>> @@ -12657,6 +12657,15 @@ loop unrolling.
> >>>    This option is enabled by default at optimization levels @option{-O1},
> >>>    @option{-O2}, @option{-O3}, @option{-Os}.
> >>>
> >>> +@opindex favoid-store-forwarding
> >>> +@item -favoid-store-forwarding
> >>> +@itemx -fno-avoid-store-forwarding
> >>> +Many CPUs will stall for many cycles when a load partially depends on 
> >>> previous
> >>> +smaller stores.  This pass tries to detect such cases and avoid the 
> >>> penalty by
> >>> +changing the order of the load and store and then fixing up the loaded 
> >>> value.
> >>> +
> >>> +Disabled by default.
> >> Is there any particular reason why this would be off by default at -O1
> >> or higher?  It would seem to me that on modern cores that this
> >> transformation should easily be a win.  Even on an old in-order core,
> >> avoiding the load with the bit insert is likely profitable, just not as
> >> much so.
> >>
> > I don't have a strong opinion for that but I believe Richard's
> > suggestion to decide this on a per-target basis also makes a lot of
> > sense.
> > Deciding whether the transformation is profitable is tightly tied to
> > the architecture in question (i.e. how large the stall is and what
> > sort of bit-insert instructions are available).
> > In order to make this more widely applicable, I think we'll need a
> > target hook that decides in which case the forwarded stores incur a
> > penalty and thus the transformation makes sense.
> You and Richi are probably right.   I'm not a big fan of passes being
> enabled/disabled on particular targets, but it may make sense here.
>
>
>
> > Afaik, for each CPU there may be cases that store forwarding is
> > handled efficiently.
> Absolutely.   But forwarding from a smaller store to a wider load is
> painful from a hardware standpoint and if we can avoid it from a codegen
> standpoint, we should.

This change is what I briefly hinted as "the complete solution" that
we had on the drawing board when we briefly talked last November in
Santa Clara.

> Did y'all look at spec2017 at all for this patch?  I've got our hardware
> guys to expose a signal for this case so that we can (in a month or so)
> get some hard data on how often it's happening in spec2017 and evaluate
> how this patch helps the most affected workloads.  But if y'all already
> have some data we can use it as a starting point.

 We have looked at all of SPEC2017, especially for coverage (i.e.,
making sure we see a significant number of uses of the transformation)
and correctness.  The gcc_r and parest_r components triggered in a
number of "interesting" ways (e.g., motivating the case of
load-elimination).  If it helps, we could share the statistics for how
often the pass triggers on compiling each of the SPEC2017 components?

Philipp.

Reply via email to