> > > > In general this parameter affects primarily -O3 builds, becuase -O2 > > hardly hits the limit. From -O3 only programs with very large units are > > affected (-O2 units hits the limit only if you do have a lot of inline > > hints in the code). > don’t quite understand here, what’s the major difference for inlining between > -O3 and -O2? > (I see -finline-functions is enabled for both O3 and O2).
-O2 has -finline-small-functions where we inline only when function is declared inline or code size is expected to shrink after the inlining. -O3 has -finline-functions where we auto-inline a lot more. > > > > > In my test bed this included Firefox with or without LTO becuase they do > > "poor man's" LTO by #including multiple .cpp files into single unified > > source which makes average units large. Also tramp3d, DLV from our C++ > > benhcmark is affected. > > > > I have some data on Firefox and I will build remainin ones: > in the following, are the data for code size? are the optimization level O3? > what’s PGO mean? Those are sizes of libxul, which is the largest library of Firefox. PGO is profile guided optimization. > > > > growth LTO+PGO PGO LTO none > > -finline-functions > > 20 (default) 83752215 94390023 93085455 103437191 94351191 > > 40 85299111 97220935 101600151 108910311 115311719 > > clang 111520431 114863807 108437807 > > > > Build times are within noise of my setup, but they are less pronounced > > than the code size difference. I think at most 1 minute out of 100. > > Note that Firefox consists of 6% Rust code that is not built by GCC and > > and building that consumes over half of the build time. > > > > Problem I am trying to solve here are is to get consistent LTO > > performance improvements compared to non-LTO. Currently there are > > some regressions: > > https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1 > > All those regressions goes away with limit increase. > > > > I tracked them down to the fact that we do not inline some very small > > functions already (such as IsHTMLWhitespace . In GCC 5 timeframe I > > tuned this parameter to 20% based on Firefox LTO benchmarks but I was > > not that serious about performance since my setup was not giving very > > reproducible results for sub 5% differences on tp5o. Since we plan to > > enable LTO by default for Tumbleweed I need to find something that does > > not cause too many regression while keeping code size advantage of > > non-LTO. > > from my understanding, the performance regression from LTO to non-LTO is > caused > by some small and important functions cannot be inlined anymore with LTO due > to more functions are > eligible to be inlined for LTO, therefore the original value for > inline-unit-growth becomes relatively smaller. Yes, with whole program optimization most functions calls are inlinable, while when in normal non-LTO build most function calls are external. Since there are a lot of small functions called cross-module in modern C++ programs it simply makes inliner to run out of the limits before getting some of useful inline decisions. I was poking about this for a while, but did not really have very good testcases available making it difficult to judge code size/performance tradeoffs here. With Firefox I can measure things better now and it is clear that 20% growth is just too small. It is small even with profile feedback where compiler knows quite well what calls to inline and more so without. > > When increasing the value of inline-unit-growth for LTO is one approach to > resolve this issue, adjusting > the sorting heuristic to sort those important and smaller routines as higher > priority to be inlined might be > another and better approach? Yes, i have also reworked the inline metrics somehwat and spent quite some time looking into dumps to see that it behaves reasonably. There was two ages old bugs I fixed in last two weeks and also added some extra tricks like penalizing cross-module inlines some time ago. Given the fact that even with profile feedback I am not able to sort the priority queue well and neither can Clang do the job, I think it is good motivation to adjust the parameter which I have set somewhat arbitrarily at a time I was not able to test it well. Honza