On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreya...@gmail.com> wrote: > > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > >> > > There was an older submission of a load-pair specific pass but this is > > a complete reimplementation and indeed significantly more general. > > Apart from being target independant, it addresses a number of > > important restrictions and can handle multiple store forwardings per > > load. > > It should be noted that it cannot handle the load-pair cases as these > > need special handling, but that's something we're planning to do in > > the future by reusing this infrastructure. > ACK. Thanks for the additional background. > > > > > >> > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > >>> index 4e8967fd8ab..c769744d178 100644 > >>> --- a/gcc/doc/invoke.texi > >>> +++ b/gcc/doc/invoke.texi > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > >>> This option is enabled by default at optimization levels @option{-O1}, > >>> @option{-O2}, @option{-O3}, @option{-Os}. > >>> > >>> +@opindex favoid-store-forwarding > >>> +@item -favoid-store-forwarding > >>> +@itemx -fno-avoid-store-forwarding > >>> +Many CPUs will stall for many cycles when a load partially depends on > >>> previous > >>> +smaller stores. This pass tries to detect such cases and avoid the > >>> penalty by > >>> +changing the order of the load and store and then fixing up the loaded > >>> value. > >>> + > >>> +Disabled by default. > >> Is there any particular reason why this would be off by default at -O1 > >> or higher? It would seem to me that on modern cores that this > >> transformation should easily be a win. Even on an old in-order core, > >> avoiding the load with the bit insert is likely profitable, just not as > >> much so. > >> > > I don't have a strong opinion for that but I believe Richard's > > suggestion to decide this on a per-target basis also makes a lot of > > sense. > > Deciding whether the transformation is profitable is tightly tied to > > the architecture in question (i.e. how large the stall is and what > > sort of bit-insert instructions are available). > > In order to make this more widely applicable, I think we'll need a > > target hook that decides in which case the forwarded stores incur a > > penalty and thus the transformation makes sense. > You and Richi are probably right. I'm not a big fan of passes being > enabled/disabled on particular targets, but it may make sense here. > > > > > Afaik, for each CPU there may be cases that store forwarding is > > handled efficiently. > Absolutely. But forwarding from a smaller store to a wider load is > painful from a hardware standpoint and if we can avoid it from a codegen > standpoint, we should.
This change is what I briefly hinted as "the complete solution" that we had on the drawing board when we briefly talked last November in Santa Clara. > Did y'all look at spec2017 at all for this patch? I've got our hardware > guys to expose a signal for this case so that we can (in a month or so) > get some hard data on how often it's happening in spec2017 and evaluate > how this patch helps the most affected workloads. But if y'all already > have some data we can use it as a starting point. We have looked at all of SPEC2017, especially for coverage (i.e., making sure we see a significant number of uses of the transformation) and correctness. The gcc_r and parest_r components triggered in a number of "interesting" ways (e.g., motivating the case of load-elimination). If it helps, we could share the statistics for how often the pass triggers on compiling each of the SPEC2017 components? Philipp.