On Sat, Jun 8, 2024 at 1:31 AM Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 6/6/24 4:10 AM, Manolis Tsamis wrote: > > This pass detects cases of expensive store forwarding and tries to avoid > > them > > by reordering the stores and using suitable bit insertion sequences. > > For example it can transform this: > > > > strb w2, [x1, 1] > > ldr x0, [x1] # Expensive store forwarding to larger load. > > > > To: > > > > ldr x0, [x1] > > strb w2, [x1] > > bfi x0, x2, 0, 8 > > > > Assembly like this can appear with bitfields or type punning / unions. > > On stress-ng when running the cpu-union microbenchmark the following > > speedups > > have been observed. > > > > Neoverse-N1: +29.4% > > Intel Coffeelake: +13.1% > > AMD 5950X: +17.5% > > > > gcc/ChangeLog: > > > > * Makefile.in: Add avoid-store-forwarding.o. > > * common.opt: New option -favoid-store-forwarding. > > * params.opt: New param store-forwarding-max-distance. > > * doc/invoke.texi: Document new pass. > > * doc/passes.texi: Document new pass. > > * passes.def: Schedule a new pass. > > * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. > > * avoid-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. > So this is getting a lot more interesting. I think the first time I > looked at this it was more concerned with stores feeding something like > a load-pair and avoiding the store forwarding penalty for that case. Am > I mis-remembering, or did it get significantly more general? > There was an older submission of a load-pair specific pass but this is a complete reimplementation and indeed significantly more general. Apart from being target independant, it addresses a number of important restrictions and can handle multiple store forwardings per load. It should be noted that it cannot handle the load-pair cases as these need special handling, but that's something we're planning to do in the future by reusing this infrastructure.
> > > > > > + > > +static unsigned int stats_sf_detected = 0; > > +static unsigned int stats_sf_avoided = 0; > > + > > +static rtx > > +get_load_mem (rtx expr) > Needs a function comment. You should probably mention that EXPR must be > a single_set in that comment. > Ok, will do. > > > + > > + rtx dest; > > + if (eliminate_load) > > + dest = gen_reg_rtx (load_inner_mode); > > + else > > + dest = SET_DEST (load); > > + > > + int move_to_front = -1; > > + int total_cost = 0; > > + > > + /* Check if we can emit bit insert instructions for all forwarded > > stores. */ > > + FOR_EACH_VEC_ELT (stores, i, it) > > + { > > + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); > > + rtx_insn *insns = NULL; > > + > > + /* If we're eliminating the load then find the store with zero offset > > + and use it as the base register to avoid a bit insert. */ > > + if (eliminate_load && it->offset == 0) > So often is this triggering? We have various codes in the gimple > optimizers to detect store followed by a load from the same address and > do the forwarding. If they're happening with any frequency that would > be a good sign code in DOM and elsewhere isn't working well. > > THe way these passes detect this case is to take store, flip the > operands around (ie, it looks like a load) and enter that into the > expression hash tables. After that standard redundancy elimination > approaches will work. > This condition doesn't handle this case (single store forwarding to same-sized load) but rather any number of smaller (non-overlapping) stores that cover the bytes of the load and hence the load can be eliminated. As the comment at the top mentions: If the stores cover all the bytes of the load without overlap then we can eliminate the load entirely and use the computed value instead. The specific condition (eliminate_load && it->offset == 0) is there to just find the store that doesn't need an offset relative to the load and use it as a "base" for the rest of the bit-insert sequences. Originally we didn't special-case this and used an uninitialized base register and then N bit-insert sequences but the result was usually suboptimal (e.g. R = zero, R = bit-insert X1, R = bit-insert X2). So if we know that the load will be eliminated we can choose one of the values as base instead and then do N - 1 subsequent bit-inserts (i.e. R = X1, R = bit-insert X2). > > > + { > > + start_sequence (); > > + > > + /* We can use a paradoxical subreg to force this to a wider mode, as > > + the only use will be inserting the bits (i.e., we don't care > > about > > + the value of the higher bits). */ > Which may be a good hint about the cases you're capturing -- if the > modes/sizes differ that would make more sense since I don't think we're > as likely to be capturing those cases. > Right, that's the case. > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 4e8967fd8ab..c769744d178 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -12657,6 +12657,15 @@ loop unrolling. > > This option is enabled by default at optimization levels @option{-O1}, > > @option{-O2}, @option{-O3}, @option{-Os}. > > > > +@opindex favoid-store-forwarding > > +@item -favoid-store-forwarding > > +@itemx -fno-avoid-store-forwarding > > +Many CPUs will stall for many cycles when a load partially depends on > > previous > > +smaller stores. This pass tries to detect such cases and avoid the > > penalty by > > +changing the order of the load and store and then fixing up the loaded > > value. > > + > > +Disabled by default. > Is there any particular reason why this would be off by default at -O1 > or higher? It would seem to me that on modern cores that this > transformation should easily be a win. Even on an old in-order core, > avoiding the load with the bit insert is likely profitable, just not as > much so. > I don't have a strong opinion for that but I believe Richard's suggestion to decide this on a per-target basis also makes a lot of sense. Deciding whether the transformation is profitable is tightly tied to the architecture in question (i.e. how large the stall is and what sort of bit-insert instructions are available). In order to make this more widely applicable, I think we'll need a target hook that decides in which case the forwarded stores incur a penalty and thus the transformation makes sense. Afaik, for each CPU there may be cases that store forwarding is handled efficiently. > > > > > diff --git a/gcc/params.opt b/gcc/params.opt > > index d34ef545bf0..b8115f5c27a 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned > > stores if it is legal to do > > Common Joined UInteger Var(param_store_merging_max_size) Init(65536) > > IntegerRange(1, 65536) Param Optimization > > Maximum size of a single store merging region in bytes. > > > > +-param=store-forwarding-max-distance= > > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) > > IntegerRange(1, 1000) Param Optimization > > +Maximum number of instruction distance that a small store forwarded to a > > larger load may stall. > I think you may need to run the update-urls script since you've added a > new option. > Ok, thanks for the hint. > > In general it seems pretty reasonable. > > I've actually added it to my tester just to see if there's any fallout. > It'll take a week to churn through the long running targets that > bootstrap in QEMU, but the crosses should have data Monday. > Great, thanks! Manolis > jeff >