> Looks like that our current documentation has a bug in the below: > > https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html > <https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html> > > -finline-functions > Consider all functions for inlining, even if they are not declared inline. > The compiler heuristically decides which functions are worth integrating in > this way. > If all calls to a given function are integrated, and the function is declared > static, then the function is normally not output as assembler code in its own > right. > Enabled at levels -O2, -O3, -Os. Also enabled by -fprofile-use and > -fauto-profile. > > It clearly mentioned that -finline-functions is enabled at -O2, O3, -Os. > > And I checked the gcc9 source code, opts.c: > > /* -O3 and -Os optimizations. */ > /* Inlining of functions reducing size is a good idea with -Os > regardless of them being declared inline. */ > { OPT_LEVELS_3_PLUS_AND_SIZE, OPT_finline_functions, NULL, 1 }, > > looks like that -finline-functions is ONLY enabled at -O3 and -Os, not for O2. > (However, I am confused with why -finline-functions should be enabled for > -Os?)
yes, it is documentation bug. Seems like Eric beat me on fixing it. > > >> > >>> > >>> In my test bed this included Firefox with or without LTO becuase they do > >>> "poor man's" LTO by #including multiple .cpp files into single unified > >>> source which makes average units large. Also tramp3d, DLV from our C++ > >>> benhcmark is affected. > >>> > >>> I have some data on Firefox and I will build remainin ones: > >> in the following, are the data for code size? are the optimization level > >> O3? > >> what’s PGO mean? > > > > Those are sizes of libxul, which is the largest library of Firefox. > > PGO is profile guided optimization. > > Okay. I see. > > looks like for LTO, the code size increase with profiling is much smaller > than > that without profiling when growth is increased from 20% to 40%. With LTo the growth is about 9%, while for non-LTO is about about 4% and with PGO it is about 3%. This is expected. For non-LTO most of translation units do not hit the limit becuase most of calls are external. Firefox is bit special here by using the #include based unified build that gets it closer to LTO, but not quite. With LTO there is only one translation unit that hits the 20% code size growth that after optimization translates to that 9% With profilef feedback code is partitioned into cold and hot sections where only hot section growths by the given percentage. For firefox about 15% of the binary is trained and rest is cold. > > for Non-LTO, the code size increase is minimal when growth is increased fro > 20% to 40%. > > However, not quite understand the last column, could you please explain a > little bit > on last column (-finline-functions)? It is non-lto build but with additional -finline-functions. GCC build machinery uses -O2 by default and -O3 for some files. Adding -finline-functions enables agressive inlining everywhere. But double checking the numbers, I must have cut&pasted wrong data here. For growth 20 -finline-functions non-LTO non-PGO I get 107272791 (so table is wrong) and increasing growth to 40 gets me 115311719 (which is correct in the table) > > >>> > >>> growth LTO+PGO PGO LTO none > >>> -finline-functions > >>> 20 (default) 83752215 94390023 93085455 103437191 94351191 > >>> 40 85299111 97220935 101600151 108910311 115311719 > >>> clang 111520431 114863807 108437807 It should be: growth LTO+PGO PGO LTO none -finline-functions 20 (default) 83752215 94390023 93085455 103437191 107272791 40 85299111 97220935 101600151 108910311 115311719 clang 111520431 114863807 108437807 So 7.5% growth. > > I was poking about this for a while, but did not really have very good > > testcases available making it difficult to judge code size/performance > > tradeoffs here. With Firefox I can measure things better now and > > it is clear that 20% growth is just too small. It is small even with > > profile feedback where compiler knows quite well what calls to inline > > and more so without. > > Yes, for C++, 20% might be too small, especially for cross-file inlining. > and C++ applications usually benefit more from inlining. Yep, that is my conclussion too. > > >> > >> When increasing the value of inline-unit-growth for LTO is one approach to > >> resolve this issue, adjusting > >> the sorting heuristic to sort those important and smaller routines as > >> higher priority to be inlined might be > >> another and better approach? > > > > Yes, i have also reworked the inline metrics somehwat and spent quite > > some time looking into dumps to see that it behaves reasonably. There > > was two ages old bugs I fixed in last two weeks and also added some > > extra tricks like penalizing cross-module inlines some time ago. Given > > the fact that even with profile feedback I am not able to sort the > > priority queue well and neither can Clang do the job, I think it is good > > motivation to adjust the parameter which I have set somewhat arbitrarily > > at a time I was not able to test it well. > > where is the code for your current heuristic to sorting the inlinable > candidates? It is in ipa-inline.c:edge-badness If you use -fdump-ipa-inline-details you can search for "Considering" in the dump file to find record about every inline decision. It dumps the badness value and also the individual values used to compute it. Honza > > Thanks. > > Qing > > > > Honza >