Re: [PATCH] Add capability to run several iterations of early optimizations

Matt Thu, 27 Oct 2011 14:53:50 -0700

Then you'd have to analyze the compile-time impact of the IPA
splitting on its own when not iterating. ?Then you should look
at what actually was the optimizations that were performed
that lead to the improvement (I can see some indirect inlining
happening, but everything else would be a bug in present
optimizers in the early pipeline - they are all designed to be
roughly independent on each other and _not_ expose new
opportunities by iteration). ?Thus - testcases?

The initial motivation for the patch was to enable more indirect

inlining and devirtualization opportunities.

Hm.

It is the proprietary codebase of my employer that these optimizationswere developed for. Multiple iterations specifically helps propogate theconcrete type information from functions that implement theAbstract Factory design pattern, allowing for cleaner runtime dynamicdispatch. I can verify that in said codebase (and in the reduced,non-proprietary examples Maxim provided earlier in the year) it worksquite effectively.


Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }

While that strategy isn't unheard of, it implies a link-time substitutionto inject new/different sub-classes of the parameterized interface.Besides limiting extensibility by requiring a rebuild/relink, it alsopresupposes that two different implementations would be mutually exclusivefor that module. That is often not the case, hence the factory patternexpressed in the other examples Maxim provided.

Since then I found the patch to be helpful in searching for

optimizationopportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2additional iterations of early optimizations [*]. ?Given that applyingmore optimizations should, theoretically, not decrease performance, thereis likely a very real bug or deficiency behind that.

It is likely early SRA that messes up, or maybe convert switch.  Early
passes should be really restricted to always profitable cleanups.

Your experiment looks useful to track down these bugs, but in general
I don't think we want to expose iterating early passes.

In these other more top-down examples of devirt I mention above, I agreewith you. Once the CFG is ordered and the analyses happen, things shouldbe propogated forward without issue. In the case of factory functions, myunderstanding and experience on this real-world codebase is that multiplepasses are required. First, to "bubble up" the concrete type info comingout of the factory function. Depending on how many layers, it may requirea couple. Second, to then forward propogate that concrete type informationfor the pointer.

There was a surprising side-effect when I started experimenting with thisipa-passes feature. In a module that contains ~100KLOC, I implementedmega-compilation (a poor-man's LTO). At two passes, the module got larger,which I expected. This minor growth continued with each additional pass,until at about 7 passes when it decreased by over 10%. I set up a scriptto run overnight to incrementally try passes and record the module size,and the "sweet spot" ended up being 54 passes as far as size. I took thethree smallest binaries and did a full performance regression at thesystem level, and the smallest binary's inclusion resulted in an ~6%performance improvement (measured as overall network I/O throughput) whileusing less CPU on a Transmeta Crusoe-based appliance. (This is a webproxy, with about 500KLOC of other code that was not compiled in this newway.)

The idea of multiple passes resulting is a smaller binary and higherperformance was like a dream. I reproduced a similar pattern on opensource projects, namely scummvm (on which I was able to use proper LTO)*.That is, smaller binaries resulted as well as decreased CPU usage. On someprojects, this could possibly be correlated with micro-level benchmarkssuch as reduced branch prediction and L1 cache misses as reported bycallgrind.

While it's possible/probable that some of the performance improvements Isaw by increasing ipa-passes were ultimately missed-optimization bugs thatshould be fixed, I'd be very surprised if *all* of those improvements werethe case. As such, I would still like to see this exposed. I would behappy to file bugs and help test any instances where it looks like anoptimization should have been gotten within a single ipa-pass.

Thanks for helping to get this feature (and the other devirt-relatedpieces) into 4.7 -- it's been a huge boon to improving our C++ designswithout sacrificing performance.

* Note that that scummvm's "sweet spot" number of iterations wasdifferent. That being said, the default of three iterations to make thetypical use of Factory pattern devirtualize correctly still resulted inimproved performance over a single pass -- just not necessarily a smallerbinary.




--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

Re: [PATCH] Add capability to run several iterations of early optimizations

Reply via email to