> On Jan 8, 2019, at 11:53 AM, Jan Hubicka <hubi...@ucw.cz> wrote: > >>> >>> In general this parameter affects primarily -O3 builds, becuase -O2 >>> hardly hits the limit. From -O3 only programs with very large units are >>> affected (-O2 units hits the limit only if you do have a lot of inline >>> hints in the code). >> don’t quite understand here, what’s the major difference for inlining >> between -O3 and -O2? >> (I see -finline-functions is enabled for both O3 and O2). > > -O2 has -finline-small-functions where we inline only when function is > declared inline or code size is expected to shrink after the inlining. > -O3 has -finline-functions where we auto-inline a lot more.
Looks like that our current documentation has a bug in the below: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html <https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html> -finline-functions Consider all functions for inlining, even if they are not declared inline. The compiler heuristically decides which functions are worth integrating in this way. If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right. Enabled at levels -O2, -O3, -Os. Also enabled by -fprofile-use and -fauto-profile. It clearly mentioned that -finline-functions is enabled at -O2, O3, -Os. And I checked the gcc9 source code, opts.c: /* -O3 and -Os optimizations. */ /* Inlining of functions reducing size is a good idea with -Os regardless of them being declared inline. */ { OPT_LEVELS_3_PLUS_AND_SIZE, OPT_finline_functions, NULL, 1 }, looks like that -finline-functions is ONLY enabled at -O3 and -Os, not for O2. (However, I am confused with why -finline-functions should be enabled for -Os?) >> >>> >>> In my test bed this included Firefox with or without LTO becuase they do >>> "poor man's" LTO by #including multiple .cpp files into single unified >>> source which makes average units large. Also tramp3d, DLV from our C++ >>> benhcmark is affected. >>> >>> I have some data on Firefox and I will build remainin ones: >> in the following, are the data for code size? are the optimization level O3? >> what’s PGO mean? > > Those are sizes of libxul, which is the largest library of Firefox. > PGO is profile guided optimization. Okay. I see. looks like for LTO, the code size increase with profiling is much smaller than that without profiling when growth is increased from 20% to 40%. for Non-LTO, the code size increase is minimal when growth is increased fro 20% to 40%. However, not quite understand the last column, could you please explain a little bit on last column (-finline-functions)? >>> >>> growth LTO+PGO PGO LTO none >>> -finline-functions >>> 20 (default) 83752215 94390023 93085455 103437191 94351191 >>> 40 85299111 97220935 101600151 108910311 115311719 >>> clang 111520431 114863807 108437807 >>> >>> Build times are within noise of my setup, but they are less pronounced >>> than the code size difference. I think at most 1 minute out of 100. >>> Note that Firefox consists of 6% Rust code that is not built by GCC and >>> and building that consumes over half of the build time. >>> >>> Problem I am trying to solve here are is to get consistent LTO >>> performance improvements compared to non-LTO. Currently there are >>> some regressions: >>> https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1 >>> All those regressions goes away with limit increase. >> >> >>> I tracked them down to the fact that we do not inline some very small >>> functions already (such as IsHTMLWhitespace . In GCC 5 timeframe I >>> tuned this parameter to 20% based on Firefox LTO benchmarks but I was >>> not that serious about performance since my setup was not giving very >>> reproducible results for sub 5% differences on tp5o. Since we plan to >>> enable LTO by default for Tumbleweed I need to find something that does >>> not cause too many regression while keeping code size advantage of >>> non-LTO. >> >> from my understanding, the performance regression from LTO to non-LTO is >> caused >> by some small and important functions cannot be inlined anymore with LTO due >> to more functions are >> eligible to be inlined for LTO, therefore the original value for >> inline-unit-growth becomes relatively smaller. > > Yes, with whole program optimization most functions calls are inlinable, > while when in normal non-LTO build most function calls are external. > Since there are a lot of small functions called cross-module in modern > C++ programs it simply makes inliner to run out of the limits before > getting some of useful inline decisions. > > I was poking about this for a while, but did not really have very good > testcases available making it difficult to judge code size/performance > tradeoffs here. With Firefox I can measure things better now and > it is clear that 20% growth is just too small. It is small even with > profile feedback where compiler knows quite well what calls to inline > and more so without. Yes, for C++, 20% might be too small, especially for cross-file inlining. and C++ applications usually benefit more from inlining. >> >> When increasing the value of inline-unit-growth for LTO is one approach to >> resolve this issue, adjusting >> the sorting heuristic to sort those important and smaller routines as higher >> priority to be inlined might be >> another and better approach? > > Yes, i have also reworked the inline metrics somehwat and spent quite > some time looking into dumps to see that it behaves reasonably. There > was two ages old bugs I fixed in last two weeks and also added some > extra tricks like penalizing cross-module inlines some time ago. Given > the fact that even with profile feedback I am not able to sort the > priority queue well and neither can Clang do the job, I think it is good > motivation to adjust the parameter which I have set somewhat arbitrarily > at a time I was not able to test it well. where is the code for your current heuristic to sorting the inlinable candidates? Thanks. Qing > > Honza