On Fri, 7 Jun 2024, Jeff Law wrote: > > > On 6/6/24 4:10 AM, Manolis Tsamis wrote: > > This pass detects cases of expensive store forwarding and tries to avoid > > them > > by reordering the stores and using suitable bit insertion sequences. > > For example it can transform this: > > > > strb w2, [x1, 1] > > ldr x0, [x1] # Expensive store forwarding to larger load. > > > > To: > > > > ldr x0, [x1] > > strb w2, [x1] > > bfi x0, x2, 0, 8 > > > > Assembly like this can appear with bitfields or type punning / unions. > > On stress-ng when running the cpu-union microbenchmark the following > > speedups > > have been observed. > > > > Neoverse-N1: +29.4% > > Intel Coffeelake: +13.1% > > AMD 5950X: +17.5% > > > > gcc/ChangeLog: > > > > * Makefile.in: Add avoid-store-forwarding.o. > > * common.opt: New option -favoid-store-forwarding. > > * params.opt: New param store-forwarding-max-distance. > > * doc/invoke.texi: Document new pass. > > * doc/passes.texi: Document new pass. > > * passes.def: Schedule a new pass. > > * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare. > > * avoid-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/avoid-store-forwarding-1.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-2.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-3.c: New test. > > * gcc.target/aarch64/avoid-store-forwarding-4.c: New test. > So this is getting a lot more interesting. I think the first time I looked at > this it was more concerned with stores feeding something like a load-pair and > avoiding the store forwarding penalty for that case. Am I mis-remembering, or > did it get significantly more general? > > > > > > > + > > +static unsigned int stats_sf_detected = 0; > > +static unsigned int stats_sf_avoided = 0; > > + > > +static rtx > > +get_load_mem (rtx expr) > Needs a function comment. You should probably mention that EXPR must be a > single_set in that comment. > > > > + > > + rtx dest; > > + if (eliminate_load) > > + dest = gen_reg_rtx (load_inner_mode); > > + else > > + dest = SET_DEST (load); > > + > > + int move_to_front = -1; > > + int total_cost = 0; > > + > > + /* Check if we can emit bit insert instructions for all forwarded stores. > > */ > > + FOR_EACH_VEC_ELT (stores, i, it) > > + { > > + it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem)); > > + rtx_insn *insns = NULL; > > + > > + /* If we're eliminating the load then find the store with zero offset > > + and use it as the base register to avoid a bit insert. */ > > + if (eliminate_load && it->offset == 0) > So often is this triggering? We have various codes in the gimple optimizers > to detect store followed by a load from the same address and do the > forwarding. If they're happening with any frequency that would be a good sign > code in DOM and elsewhere isn't working well. > > THe way these passes detect this case is to take store, flip the operands > around (ie, it looks like a load) and enter that into the expression hash > tables. After that standard redundancy elimination approaches will work. > > > > + { > > + start_sequence (); > > + > > + /* We can use a paradoxical subreg to force this to a wider mode, as > > + the only use will be inserting the bits (i.e., we don't care > > about > > + the value of the higher bits). */ > Which may be a good hint about the cases you're capturing -- if the > modes/sizes differ that would make more sense since I don't think we're as > likely to be capturing those cases.
Yeah, we handle stores from constants quite well and FRE can forward stores from SSA names to smaller loads by inserting BIT_FIELD_REFs but it does have some additional restrictions. > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 4e8967fd8ab..c769744d178 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -12657,6 +12657,15 @@ loop unrolling. > > This option is enabled by default at optimization levels @option{-O1}, > > @option{-O2}, @option{-O3}, @option{-Os}. > > > > +@opindex favoid-store-forwarding > > +@item -favoid-store-forwarding > > +@itemx -fno-avoid-store-forwarding > > +Many CPUs will stall for many cycles when a load partially depends on > > previous > > +smaller stores. This pass tries to detect such cases and avoid the penalty > > by > > +changing the order of the load and store and then fixing up the loaded > > value. > > + > > +Disabled by default. > Is there any particular reason why this would be off by default at -O1 or > higher? It would seem to me that on modern cores that this transformation > should easily be a win. Even on an old in-order core, avoiding the load with > the bit insert is likely profitable, just not as much so. I would think it's the targets to decide for a default. > > diff --git a/gcc/params.opt b/gcc/params.opt > > index d34ef545bf0..b8115f5c27a 100644 > > --- a/gcc/params.opt > > +++ b/gcc/params.opt > > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned > > stores if it is legal to do > > Common Joined UInteger Var(param_store_merging_max_size) Init(65536) > > IntegerRange(1, 65536) Param Optimization > > Maximum size of a single store merging region in bytes. > > > > +-param=store-forwarding-max-distance= > > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) > > IntegerRange(1, 1000) Param Optimization > > +Maximum number of instruction distance that a small store forwarded to a > > larger load may stall. > I think you may need to run the update-urls script since you've added a new > option. > > > In general it seems pretty reasonable. > > I've actually added it to my tester just to see if there's any fallout. It'll > take a week to churn through the long running targets that bootstrap in QEMU, > but the crosses should have data Monday. > > jeff > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)