On Sat, Jun 8, 2024 at 1:31 AM Jeff Law <jeffreya...@gmail.com> wrote:
>
>
>
> On 6/6/24 4:10 AM, Manolis Tsamis wrote:
> > This pass detects cases of expensive store forwarding and tries to avoid 
> > them
> > by reordering the stores and using suitable bit insertion sequences.
> > For example it can transform this:
> >
> >       strb    w2, [x1, 1]
> >       ldr     x0, [x1]      # Expensive store forwarding to larger load.
> >
> > To:
> >
> >       ldr     x0, [x1]
> >       strb    w2, [x1]
> >       bfi     x0, x2, 0, 8
> >
> > Assembly like this can appear with bitfields or type punning / unions.
> > On stress-ng when running the cpu-union microbenchmark the following 
> > speedups
> > have been observed.
> >
> >    Neoverse-N1:      +29.4%
> >    Intel Coffeelake: +13.1%
> >    AMD 5950X:        +17.5%
> >
> > gcc/ChangeLog:
> >
> >       * Makefile.in: Add avoid-store-forwarding.o.
> >       * common.opt: New option -favoid-store-forwarding.
> >       * params.opt: New param store-forwarding-max-distance.
> >       * doc/invoke.texi: Document new pass.
> >       * doc/passes.texi: Document new pass.
> >       * passes.def: Schedule a new pass.
> >       * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> >       * avoid-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> So this is getting a lot more interesting.  I think the first time I
> looked at this it was more concerned with stores feeding something like
> a load-pair and avoiding the store forwarding penalty for that case.  Am
> I mis-remembering, or did it get significantly more general?
>
There was an older submission of a load-pair specific pass but this is
a complete reimplementation and indeed significantly more general.
Apart from being target independant, it addresses a number of
important restrictions and can handle multiple store forwardings per
load.
It should be noted that it cannot handle the load-pair cases as these
need special handling, but that's something we're planning to do in
the future by reusing this infrastructure.

>
>
>
>
> > +
> > +static unsigned int stats_sf_detected = 0;
> > +static unsigned int stats_sf_avoided = 0;
> > +
> > +static rtx
> > +get_load_mem (rtx expr)
> Needs a function comment.  You should probably mention that EXPR must be
> a single_set in that comment.
>
Ok, will do.

>
>
>   +
> > +  rtx dest;
> > +  if (eliminate_load)
> > +    dest = gen_reg_rtx (load_inner_mode);
> > +  else
> > +    dest = SET_DEST (load);
> > +
> > +  int move_to_front = -1;
> > +  int total_cost = 0;
> > +
> > +  /* Check if we can emit bit insert instructions for all forwarded 
> > stores.  */
> > +  FOR_EACH_VEC_ELT (stores, i, it)
> > +    {
> > +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
> > +      rtx_insn *insns = NULL;
> > +
> > +      /* If we're eliminating the load then find the store with zero offset
> > +      and use it as the base register to avoid a bit insert.  */
> > +      if (eliminate_load && it->offset == 0)
> So often is this triggering?  We have various codes in the gimple
> optimizers to detect store followed by a load from the same address and
> do the forwarding.  If they're happening with any frequency that would
> be a good sign code in DOM and elsewhere isn't working well.
>
> THe way these passes detect this case is to take store, flip the
> operands around (ie, it looks like a load) and enter that into the
> expression hash tables.  After that standard redundancy elimination
> approaches will work.
>
This condition doesn't handle this case (single store forwarding to
same-sized load) but rather any number of smaller (non-overlapping)
stores that cover the bytes of the load and hence the load can be
eliminated. As the comment at the top mentions: If the stores cover
all the bytes of the load without overlap then we can eliminate the
load entirely and use the computed value instead.

The specific condition (eliminate_load && it->offset == 0) is there to
just find the store that doesn't need an offset relative to the load
and use it as a "base" for the rest of the bit-insert sequences.
Originally we didn't special-case this and used an uninitialized base
register and then N bit-insert sequences but the result was usually
suboptimal (e.g. R = zero, R = bit-insert X1, R = bit-insert X2).
So if we know that the load will be eliminated we can choose one of
the values as base instead and then do N - 1 subsequent bit-inserts
(i.e. R = X1, R = bit-insert X2).

>
> > +     {
> > +       start_sequence ();
> > +
> > +       /* We can use a paradoxical subreg to force this to a wider mode, as
> > +          the only use will be inserting the bits (i.e., we don't care 
> > about
> > +          the value of the higher bits).  */
> Which may be a good hint about the cases you're capturing -- if the
> modes/sizes differ that would make more sense since I don't think we're
> as likely to be capturing those cases.
>
Right, that's the case.

>
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 4e8967fd8ab..c769744d178 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -12657,6 +12657,15 @@ loop unrolling.
> >   This option is enabled by default at optimization levels @option{-O1},
> >   @option{-O2}, @option{-O3}, @option{-Os}.
> >
> > +@opindex favoid-store-forwarding
> > +@item -favoid-store-forwarding
> > +@itemx -fno-avoid-store-forwarding
> > +Many CPUs will stall for many cycles when a load partially depends on 
> > previous
> > +smaller stores.  This pass tries to detect such cases and avoid the 
> > penalty by
> > +changing the order of the load and store and then fixing up the loaded 
> > value.
> > +
> > +Disabled by default.
> Is there any particular reason why this would be off by default at -O1
> or higher?  It would seem to me that on modern cores that this
> transformation should easily be a win.  Even on an old in-order core,
> avoiding the load with the bit insert is likely profitable, just not as
> much so.
>
I don't have a strong opinion for that but I believe Richard's
suggestion to decide this on a per-target basis also makes a lot of
sense.
Deciding whether the transformation is profitable is tightly tied to
the architecture in question (i.e. how large the stall is and what
sort of bit-insert instructions are available).
In order to make this more widely applicable, I think we'll need a
target hook that decides in which case the forwarded stores incur a
penalty and thus the transformation makes sense.
Afaik, for each CPU there may be cases that store forwarding is
handled efficiently.

>
>
>
> > diff --git a/gcc/params.opt b/gcc/params.opt
> > index d34ef545bf0..b8115f5c27a 100644
> > --- a/gcc/params.opt
> > +++ b/gcc/params.opt
> > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned 
> > stores if it is legal to do
> >   Common Joined UInteger Var(param_store_merging_max_size) Init(65536) 
> > IntegerRange(1, 65536) Param Optimization
> >   Maximum size of a single store merging region in bytes.
> >
> > +-param=store-forwarding-max-distance=
> > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) 
> > IntegerRange(1, 1000) Param Optimization
> > +Maximum number of instruction distance that a small store forwarded to a 
> > larger load may stall.
> I think you may need to run the update-urls script since you've added a
> new option.
>
Ok, thanks for the hint.

>
> In general it seems pretty reasonable.
>
> I've actually added it to my tester just to see if there's any fallout.
> It'll take a week to churn through the long running targets that
> bootstrap in QEMU, but the crosses should have data Monday.
>
Great, thanks!

Manolis

> jeff
>

Reply via email to