> > 
> > In general this parameter affects primarily -O3 builds, becuase -O2
> > hardly hits the limit. From -O3 only programs with very large units are
> > affected (-O2 units hits the limit only if you do have a lot of inline
> > hints in the code).
> don’t quite understand here, what’s the major difference for inlining between 
> -O3 and -O2? 
> (I see -finline-functions is enabled for both O3 and O2).

-O2 has -finline-small-functions where we inline only when function is
declared inline or code size is expected to shrink after the inlining.
-O3 has -finline-functions where we auto-inline a lot more.
> 
> > 
> > In my test bed this included Firefox with or without LTO becuase they do
> > "poor man's" LTO by #including multiple .cpp files into single unified
> > source which makes average units large.  Also tramp3d, DLV from our C++
> > benhcmark is affected. 
> > 
> > I have some data on Firefox and I will build remainin ones:
> in the following, are the data for code size? are the optimization level O3?
> what’s PGO mean?  

Those are sizes of libxul, which is the largest library of Firefox.
PGO is profile guided optimization.
> > 
> > growth              LTO+PGO    PGO       LTO        none      
> > -finline-functions
> > 20 (default)   83752215   94390023  93085455  103437191  94351191
> > 40             85299111   97220935  101600151 108910311  115311719
> > clang          111520431            114863807 108437807
> > 
> > Build times are within noise of my setup, but they are less pronounced
> > than the code size difference. I think at most 1 minute out of 100.
> > Note that Firefox consists of 6% Rust code that is not built by GCC and
> > and building that consumes over half of the build time.
> > 
> > Problem I am trying to solve here are is to get consistent LTO
> > performance improvements compared to non-LTO. Currently there are
> > some regressions:
> > https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b6ba1ebfe913d152989495d8cb450bce02f27d44&newProject=try&newRevision=c7bd18804e328ed490eab707072b3cf59da91042&framework=1&showOnlyComparable=1&showOnlyImportant=1
> > All those regressions goes away with limit increase.
> 
> 
> > I tracked them down to the fact that we do not inline some very small
> > functions already (such as IsHTMLWhitespace .  In GCC 5 timeframe I
> > tuned this parameter to 20% based on Firefox LTO benchmarks but I was
> > not that serious about performance since my setup was not giving very
> > reproducible results for sub 5% differences on tp5o. Since we plan to
> > enable LTO by default for Tumbleweed I need to find something that does
> > not cause too many regression while keeping code size advantage of
> > non-LTO.
> 
> from my understanding, the performance regression from LTO to non-LTO is 
> caused 
> by some small and important functions cannot be inlined anymore with LTO due 
> to more functions are
> eligible to be inlined for LTO, therefore the original value for 
> inline-unit-growth becomes relatively smaller.

Yes, with whole program optimization most functions calls are inlinable,
while when in normal non-LTO build most function calls are external.
Since there are a lot of small functions called cross-module in modern
C++ programs it simply makes inliner to run out of the limits before
getting some of useful inline decisions.

I was poking about this for a while, but did not really have very good
testcases available making it difficult to judge code size/performance
tradeoffs here.  With Firefox I can measure things better now and
it is clear that 20% growth is just too small. It is small even with
profile feedback where compiler knows quite well what calls to inline
and more so without.
> 
> When increasing the value of inline-unit-growth for LTO is one approach to 
> resolve this issue, adjusting
> the sorting heuristic to sort those important and smaller routines as higher 
> priority to be inlined might be
> another and better approach? 

Yes, i have also reworked the inline metrics somehwat and spent quite
some time looking into dumps to see that it behaves reasonably.  There
was two ages old bugs I fixed in last two weeks and also added some
extra tricks like penalizing cross-module inlines some time ago. Given
the fact that even with profile feedback I am not able to sort the
priority queue well and neither can Clang do the job, I think it is good
motivation to adjust the parameter which I have set somewhat arbitrarily
at a time I was not able to test it well.

Honza

Reply via email to