On 2020/9/12 01:36, Tamar Christina wrote: > Hi Martin, > >> >> can you please confirm that the difference between these two is all due to >> the last option -fno-inline-functions-called-once ? Is LTo necessary? >> I.e., can >> you run the benchmark also built with the branch compiler and -mcpu=native >> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ? >> > > Done, see below. > >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto >> | -24% | | | >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >>> | Branch | -mcpu=native -Ofast -fomit-frame-pointer >> | -26% | | | >>> +----------+------------------------------------------------------------------------------ >> --------------------------------------------------------------------+--------------+--+--+ >> >>> >>> (Hopefully the table shows up correct) >> >> it does show OK for me, thanks. >> >>> >>> It looks like your patch definitely does improve the basic cases. So >>> there's not much difference between lto and non-lto anymore and it's >> much Better than GCC 10. However it still contains the regression introduced >> by Honza's changes. >> >> I assume these are rates, not times, so negative means bad. But do I >> understand it correctly that you're comparing against GCC 10 with the two >> parameters set to rather special values? Because your table seems to >> indicate that even for you, the branch is faster than GCC 10 with just - >> mcpu=native -Ofast -fomit-frame-pointer. > > Yes these are indeed rates, and indeed I am comparing against the same options > we used to get the fastest rates on before which is the two parameters and > the inline flag. > >> >> So is the problem that the best obtainable run-time, even with obscure >> options, from the branch is slower than the best obtainable run-time from >> GCC 10? >> > > Yeah that's the problem, when we compare the two we're still behind. > > I've done the additional two runs > > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Compiler | Flags > > | diff GCC 10 | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer > > | -44% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -36% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > | -22% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 > -fno-inline-functions-called-once | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > > | -24% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > > | -26% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto > -fno-inline-functions-called-once > | -12% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > | Branch | -mcpu=native -Ofast -fomit-frame-pointer > -fno-inline-functions-called-once > | -11% | > +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+ > > And this confirms that indeed LTO isn't needed and that the branch > without any options is indeed much better than it was on GCC 10 without any > options. > > It also confirms that the only remaining difference is in the > -fno-inline-functions-called-once If -fno-inline-functions-called-once is added, the recursive call function digits_2 won't be inlined, as each digits_2 is specialized to clone nodes and called once only, so performance back is expected, I guess it is somewhat similar to -fno-inline for this case. @Jambor @Honza Any progress about this (--param controlling maximal recursion depth) and the other regression about LOOP_GUARD_WITH_PREDICTION in PR96825(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96825) please? :) I tested the current master FSF code, the regression still exists... Thanks, Xionghu