> On Jan 8, 2019, at 11:53 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> 
>>> 
>>> In general this parameter affects primarily -O3 builds, becuase -O2
>>> hardly hits the limit. From -O3 only programs with very large units are
>>> affected (-O2 units hits the limit only if you do have a lot of inline
>>> hints in the code).
>> don’t quite understand here, what’s the major difference for inlining 
>> between -O3 and -O2? 
>> (I see -finline-functions is enabled for both O3 and O2).
> 
> -O2 has -finline-small-functions where we inline only when function is
> declared inline or code size is expected to shrink after the inlining.
> -O3 has -finline-functions where we auto-inline a lot more.

Looks like that our current documentation has a bug in the below:

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html 
<https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>

-finline-functions
Consider all functions for inlining, even if they are not declared inline. The 
compiler heuristically decides which functions are worth integrating in this 
way.
If all calls to a given function are integrated, and the function is declared 
static, then the function is normally not output as assembler code in its own 
right.
Enabled at levels -O2, -O3, -Os. Also enabled by -fprofile-use and 
-fauto-profile.

It clearly mentioned that -finline-functions is enabled at -O2, O3, -Os. 

And I checked the gcc9 source code, opts.c:

    /* -O3 and -Os optimizations.  */
    /* Inlining of functions reducing size is a good idea with -Os
       regardless of them being declared inline.  */
    { OPT_LEVELS_3_PLUS_AND_SIZE, OPT_finline_functions, NULL, 1 },

looks like that -finline-functions is ONLY enabled at -O3 and -Os, not for O2.
(However, I am confused with why -finline-functions should be enabled for -Os?)

>> 
>>> 
>>> In my test bed this included Firefox with or without LTO becuase they do
>>> "poor man's" LTO by #including multiple .cpp files into single unified
>>> source which makes average units large.  Also tramp3d, DLV from our C++
>>> benhcmark is affected. 
>>> 
>>> I have some data on Firefox and I will build remainin ones:
>> in the following, are the data for code size? are the optimization level O3?
>> what’s PGO mean?  
> 
> Those are sizes of libxul, which is the largest library of Firefox.
> PGO is profile guided optimization.

Okay.  I see. 

looks like for LTO,  the code size increase with profiling is much smaller than
that without profiling when growth is increased from 20% to 40%.  

for Non-LTO, the code size increase is minimal when growth is increased fro 20% 
to 40%.

However, not quite understand the last column, could you please explain a 
little bit
on last column (-finline-functions)?

>>> 
>>> growth              LTO+PGO    PGO       LTO        none      
>>> -finline-functions
>>> 20 (default)   83752215   94390023  93085455  103437191  94351191
>>> 40             85299111   97220935  101600151 108910311  115311719
>>> clang          111520431            114863807 108437807
>>> 
>>> Build times are within noise of my setup, but they are less pronounced
>>> than the code size difference. I think at most 1 minute out of 100.
>>> Note that Firefox consists of 6% Rust code that is not built by GCC and
>>> and building that consumes over half of the build time.
>>> 
>>> Problem I am trying to solve here are is to get consistent LTO
>>> performance improvements compared to non-LTO. Currently there are
>>> some regressions:
>>> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1
>>> All those regressions goes away with limit increase.
>> 
>> 
>>> I tracked them down to the fact that we do not inline some very small
>>> functions already (such as IsHTMLWhitespace .  In GCC 5 timeframe I
>>> tuned this parameter to 20% based on Firefox LTO benchmarks but I was
>>> not that serious about performance since my setup was not giving very
>>> reproducible results for sub 5% differences on tp5o. Since we plan to
>>> enable LTO by default for Tumbleweed I need to find something that does
>>> not cause too many regression while keeping code size advantage of
>>> non-LTO.
>> 
>> from my understanding, the performance regression from LTO to non-LTO is 
>> caused 
>> by some small and important functions cannot be inlined anymore with LTO due 
>> to more functions are
>> eligible to be inlined for LTO, therefore the original value for 
>> inline-unit-growth becomes relatively smaller.
> 
> Yes, with whole program optimization most functions calls are inlinable,
> while when in normal non-LTO build most function calls are external.
> Since there are a lot of small functions called cross-module in modern
> C++ programs it simply makes inliner to run out of the limits before
> getting some of useful inline decisions.
> 
> I was poking about this for a while, but did not really have very good
> testcases available making it difficult to judge code size/performance
> tradeoffs here.  With Firefox I can measure things better now and
> it is clear that 20% growth is just too small. It is small even with
> profile feedback where compiler knows quite well what calls to inline
> and more so without.

Yes, for C++, 20% might be too small, especially for cross-file inlining.
and C++ applications usually benefit more from inlining. 

>> 
>> When increasing the value of inline-unit-growth for LTO is one approach to 
>> resolve this issue, adjusting
>> the sorting heuristic to sort those important and smaller routines as higher 
>> priority to be inlined might be
>> another and better approach? 
> 
> Yes, i have also reworked the inline metrics somehwat and spent quite
> some time looking into dumps to see that it behaves reasonably.  There
> was two ages old bugs I fixed in last two weeks and also added some
> extra tricks like penalizing cross-module inlines some time ago. Given
> the fact that even with profile feedback I am not able to sort the
> priority queue well and neither can Clang do the job, I think it is good
> motivation to adjust the parameter which I have set somewhat arbitrarily
> at a time I was not able to test it well.

where is the code for your current heuristic to sorting the inlinable 
candidates?

Thanks.

Qing
> 
> Honza

Reply via email to