Tamar Christina <tamar.christ...@arm.com> 于2020年9月12日周六 上午1:39写道:
> Hi Martin, > > > > > can you please confirm that the difference between these two is all due > to > > the last option -fno-inline-functions-called-once ? Is LTo necessary? > I.e., can > > you run the benchmark also built with the branch compiler and > -mcpu=native > > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? > > > > Done, see below. > > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | | | > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | | | > > > > +----------+------------------------------------------------------------------------------ > > > --------------------------------------------------------------------+--------------+--+--+ > > > > > > > > (Hopefully the table shows up correct) > > > > it does show OK for me, thanks. > > > > > > > > It looks like your patch definitely does improve the basic cases. So > > > there's not much difference between lto and non-lto anymore and it's > > much Better than GCC 10. However it still contains the regression > introduced > > by Honza's changes. > > > > I assume these are rates, not times, so negative means bad. But do I > > understand it correctly that you're comparing against GCC 10 with the two > > parameters set to rather special values? Because your table seems to > > indicate that even for you, the branch is faster than GCC 10 with just - > > mcpu=native -Ofast -fomit-frame-pointer. > > Yes these are indeed rates, and indeed I am comparing against the same > options > we used to get the fastest rates on before which is the two parameters and > the inline flag. > > > > > So is the problem that the best obtainable run-time, even with obscure > > options, from the branch is slower than the best obtainable run-time from > > GCC 10? > > > > Yeah that's the problem, when we compare the two we're still behind. > > I've done the additional two runs > > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Compiler | Flags > > | diff GCC 10 | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer > > | -44% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -36% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > | -22% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > -fno-inline-functions-called-once > | -12% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > -fno-inline-functions-called-once > | -11% | > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > > And this confirms that indeed LTO isn't needed and that the branch > without any options is indeed much better than it was on GCC 10 without > any options. > > It also confirms that the only remaining difference is in the > -fno-inline-functions-called-once > > > > > > >> > And I tried 3 runs > > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param > > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > > >> > -fno-inline-functions-called-once > > >> > > >> This is the first time I saw -fno-inline-functions-called-once used > > >> in this context. This seems to indicate we are looking at another > > >> problem that at least I have not known about yet. Can you please > > >> upload somewhere the inlining WPA dumps with and without the option? > > > > > > We used it to cover up for the register allocation issue where in > > > lining some large functions would cause massive spilling. Looks like > > > it still has an effect now but even with it we're still seeing the 12% > > regression. > > > > > > Which option is this? -fdump-ipa-cgraph? > > > > -fdump-ipa-inline-details and -fdump-ipa-cp-details. > > I've kicked off the CI runs and will get you the dumps on Monday. > > Cheers, > Tamar > > > > > It would be nice if the slowdown was all due to the inliner. But the > predictors > > changes might of course have quite an impact also on other optimizations. > > > > Martin > > Hi Martin, Thanks for your work. In case you are interested, here is the exchange2 result for your branch on our Cascadelake server (based on Tamar's test and our regular configuration): | Compiler | Flags | single-core diff GCC10 | multi-core diff GCC10 | |---------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------| | GCC10.1 | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | - | - | | GCC10.1 | -march=native -Ofast -funroll-loops | -32% | -37% | | GCC10.1 | -march=native -Ofast -funroll-loops -flto | -32% | -37% | | GCC11 | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -20% | -13% | | Branch | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 | -39% | -28% | | Branch | -march=native -Ofast -funroll-loops -flto --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 -fno-inline-functions-called-once | -20% | -13% | | Branch | -march=native -Ofast -funroll-loops -flto | -39% | -28% | | Branch | -march=native -Ofast -funroll-loops | -41% | -29% | | Branch | -march=native -Ofast -funroll-loops -flto -fno-inline-functions-called-once | -19% | -13% | | Branch | -march=native -Ofast -funroll-loops -fno-inline-functions-called-once | -20% | -13% | For multi-core tests, it can provide better performance without extra ipa options, but still 12% regression compared with GCC10's best score. Also for single-core, there's a about 7% gap between the branch and GCC10.1. Regards, Hongyu Wang