On Mon, Jun 10, 2024 at 9:27 PM Philipp Tomsich <philipp.toms...@vrull.eu> wrote: > > On Mon, 10 Jun 2024 at 20:03, Jeff Law <jeffreya...@gmail.com> wrote: > > > > > > > > On 6/10/24 1:55 AM, Manolis Tsamis wrote: > > > > >> > > > There was an older submission of a load-pair specific pass but this is > > > a complete reimplementation and indeed significantly more general. > > > Apart from being target independant, it addresses a number of > > > important restrictions and can handle multiple store forwardings per > > > load. > > > It should be noted that it cannot handle the load-pair cases as these > > > need special handling, but that's something we're planning to do in > > > the future by reusing this infrastructure. > > ACK. Thanks for the additional background. > > > > > > > > > >> > > >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > >>> index 4e8967fd8ab..c769744d178 100644 > > >>> --- a/gcc/doc/invoke.texi > > >>> +++ b/gcc/doc/invoke.texi > > >>> @@ -12657,6 +12657,15 @@ loop unrolling. > > >>> This option is enabled by default at optimization levels > > >>> @option{-O1}, > > >>> @option{-O2}, @option{-O3}, @option{-Os}. > > >>> > > >>> +@opindex favoid-store-forwarding > > >>> +@item -favoid-store-forwarding > > >>> +@itemx -fno-avoid-store-forwarding > > >>> +Many CPUs will stall for many cycles when a load partially depends on > > >>> previous > > >>> +smaller stores. This pass tries to detect such cases and avoid the > > >>> penalty by > > >>> +changing the order of the load and store and then fixing up the loaded > > >>> value. > > >>> + > > >>> +Disabled by default. > > >> Is there any particular reason why this would be off by default at -O1 > > >> or higher? It would seem to me that on modern cores that this > > >> transformation should easily be a win. Even on an old in-order core, > > >> avoiding the load with the bit insert is likely profitable, just not as > > >> much so. > > >> > > > I don't have a strong opinion for that but I believe Richard's > > > suggestion to decide this on a per-target basis also makes a lot of > > > sense. > > > Deciding whether the transformation is profitable is tightly tied to > > > the architecture in question (i.e. how large the stall is and what > > > sort of bit-insert instructions are available). > > > In order to make this more widely applicable, I think we'll need a > > > target hook that decides in which case the forwarded stores incur a > > > penalty and thus the transformation makes sense. > > You and Richi are probably right. I'm not a big fan of passes being > > enabled/disabled on particular targets, but it may make sense here. > > > > > > > > > Afaik, for each CPU there may be cases that store forwarding is > > > handled efficiently. > > Absolutely. But forwarding from a smaller store to a wider load is > > painful from a hardware standpoint and if we can avoid it from a codegen > > standpoint, we should. > > This change is what I briefly hinted as "the complete solution" that > we had on the drawing board when we briefly talked last November in > Santa Clara. > > > Did y'all look at spec2017 at all for this patch? I've got our hardware > > guys to expose a signal for this case so that we can (in a month or so) > > get some hard data on how often it's happening in spec2017 and evaluate > > how this patch helps the most affected workloads. But if y'all already > > have some data we can use it as a starting point. > > We have looked at all of SPEC2017, especially for coverage (i.e., > making sure we see a significant number of uses of the transformation) > and correctness. The gcc_r and parest_r components triggered in a > number of "interesting" ways (e.g., motivating the case of > load-elimination). If it helps, we could share the statistics for how > often the pass triggers on compiling each of the SPEC2017 components? > Below is a table with 1) the number of successful store-forwarding-avoidance transformations and 2) the number of these where the load was also eliminated. This is SPEC2017 intrate and fprate; the omitted benchmarks in each case have zero transformations.
Following Richard's comment I did two runs: One with the pass placed just after cse1 (this patch) and one with it placed after sched1 (Richard's suggestion). I see an increased number of transformations after sched1 and also in some testcases we get improved code generation so it looks promising. The original motivation was that placing this early will enable subsequent passes to optimize the transformed code, but it looks like this is not a major issue and the load/store placement it more important. I plan to follow up and provide some rough performance metrics and example assembly code of the transformation for these benchmarks. I also tried increasing the maximum distance from 10 to 15 and so only a small increase in transformation count. From what I've seen most cases that we care about are usually close. The data (benchmark, # transformed, # load elimination): After cse1: 500: 32 1 502: 89 9 510: 517 409 511: 12 1 520: 1 0 521: 4 1 525: 51 1 526: 46 11 557: 1 1 After sched1: 500: 45 13 502: 172 33 507: 23 23 510: 544 427 511: 41 29 520: 119 26 521: 19 11 523: 4 4 525: 129 44 526: 120 70 531: 1 1 538: 30 18 541: 2 2 557: 4 4 Note: ran on AArch64 with -O3 -mcpu=neoverse-n1 -flto=32. Thanks, Manolis > Philipp.