Hi, so after reading the whole discussion, here are some of my toughts for sanity check combined in one to reduce inbox pollution ;)
Concerning Richard's cost tweaks: There is longer story why I like it ;) I originally considered further tweaking of cost function as mostly lost case as the representation of program at that time is way too close to source level to estimate it's cost properly. This got even bit worse in 4.0 times by gimplifier introducing not very predictable noise. Instead I went to plan of optimizing functions early that ought to give better estimates. It seems to me that we need to know both - the code size and expected time consumed by function to have chance to predict the benefits in some way. On tree-profiling and some my local patches I hope to sort out soonish I am mostly there and I didn some limited benchmarking. Overall the early optimization seems to do good job for SPEC (over 1% speedup in whole program mode is more than I expected), but it does almost nothing to C++ testcases (about 10% speedup to POOMA and about 0 to Gerald's application). I believe the reason is that C++ testcases consist of little functions that are unoptimizable by themselves so the context is not big enought. In parallel with Richard's efforts, I tought that problem there is ineed with the "abstraction functions", ie functions just accepting arguments and calling other function or returning some field. There is extremly high amount of those (from profiling early one can see that for every operation executed in resulting program, there are hunderds of function calls elliminated by inlining) Clearly with any inlining limits if the cost function computes non-zero cost to such a forwarders, we are going to have dificult time finding tresholds. I planed to write an pattern matching for these functions to bump them to 0 cost, but it looks like Richard's patch is pretty interesting idea. His results with limits set to 50 shows that he ineed managed to get those forwarders very cheap, so I believe that this idea might ineed work well, with some additional tweaking. Only what I am affraid of is the fact that number of inlines will no longer be linear function of code size esitmate increase that is limited by linear fraction of whole unit. However only "forwarders" having at most one call comes out free, so this is still dominated by the longest path in callgraph consisting of these in the program. Unfortunately this can be high and we can produce _a lot_ of grabage inlining these. One trick that I came across is to do two stage inlining - first inline just functions whose growth estimates are <= 0 in the bottom-up approach , do early optimizations to reduce garbage and then do "real inlining job". This way we might trottle amount of garbage produced by inliner and get more realistic estimates of the function bodies, but I am not at all sure about this. It would also help profiling performance on tramp3d definitly. Concerning -fobey-inline: I really doubt this is going to help C++ programmers. I think it might be usefull to kernel and I can make slightly cleaner implementation (without changes in frontends) if there is really good use for it. Can someone point me to existing codebase where -fobey-inline brings considerable improvements over defaultinlining heuristics? I've seen a lot of argumenting in this direction but never actual bigger application that needs it. It might be also possible to strengten the function "inline" keywords have for heuristics - either multiply priority by 2 for functions declared inline so the candidates gets in first or do two stage inlining, first for inline functions and other for autoinline. But this is probably not going to help those folks complaining mostly about -O2 ignoring inline, right? Concerning multiple heuristics: I really don't like this idea ;) Still think we can make heuristics to adapt to the programming style it is fed by just because often programs consist of multiple such styles. Concerning compilation time/speed tradeoffs: Since whole task of inliner is to slow down compiler in order to improve resulting code, it is dificult to blame it for doing it's job. While I was in easy position with original heuristics where the pre-cgraph code produced just that much of inlining so it was easy to speedup both, now we obviously do too little of inlining, so we need to expect some slodowns. I would define a sucess of heuristics if it results in faster and smaller code, the compilation time is kind of secondary. However definitly for code bases (like SPEC) where extra inlining don't help, we should not slow down seriously (over 1% I guess) Concerning growth limits: If you take a look on when -finline-unit-growth limit hits, it is clear that it hits very often on small units (several times in the kernel, testsuite and such) just because there is tinny space to manuever. It hits almost never on medium units (in GCC bootstrap it hits almost never) and almost always on big units My intuition alwas has been that for larger units the limits should be much smaller and pooma was major counter example. If we suceed solving this, I would guess we can introduce something like small-unit-insns limit and limit size of units that exceeds this. Does this sound sane? Concerning 4.0 timming: I agree that we should started month or two ago, but I unfortunately wasn't able to do any usefull job at that time. But tunning much earlier in 4.0 cycle was unprofitable, since the compiler was just moving too fast. I experimented with this at tree-ssa merging time but basically I resulted with slowdown that would shot it off it's release criteria so didn't wanted to interfere with it. We however solved number of problems since than, so tunning now is more pleasant ;) In 3.4 we also tuned late and it seems to be sane step to me actually. It is not dificult to compute code size esitmates before we go to gimple to have more apple-to-apple comparsion. While tree-SSA scored pretty bad in this test, I tried it in mid of december and things was quite compiarable to -O2 complation time. The problem with inliner is that it depends a lot on rest of compiler.... Overall I would like to continue on the patch of pre-inline and attempts to tune somehow Richard's idea of inlining cost function now and sort out the quadratic issues in cgraph if they really shows up now and lets hope that we will end up with something generally useable soon and if simple changes to cost would help to 4.0 I would still like to consider it, similarly as Richard's double linked stuff as his benchmarks looks pretty convicing ;) Honza