Then you'd have to analyze the compile-time impact of the IPA
splitting on its own when not iterating. ?Then you should look
at what actually was the optimizations that were performed
that lead to the improvement (I can see some indirect inlining
happening, but everything else would be a bug in present
optimizers in the early pipeline - they are all designed to be
roughly independent on each other and _not_ expose new
opportunities by iteration). ?Thus - testcases?

The initial motivation for the patch was to enable more indirect
inlining and devirtualization opportunities.

Hm.

It is the proprietary codebase of my employer that these optimizations were developed for. Multiple iterations specifically helps propogate the concrete type information from functions that implement the Abstract Factory design pattern, allowing for cleaner runtime dynamic dispatch. I can verify that in said codebase (and in the reduced, non-proprietary examples Maxim provided earlier in the year) it works quite effectively.

Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }

While that strategy isn't unheard of, it implies a link-time substitution to inject new/different sub-classes of the parameterized interface. Besides limiting extensibility by requiring a rebuild/relink, it also presupposes that two different implementations would be mutually exclusive for that module. That is often not the case, hence the factory pattern expressed in the other examples Maxim provided.

Since then I found the patch to be helpful in searching for
optimization opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2 additional iterations of early optimizations [*]. ?Given that applying more optimizations should, theoretically, not decrease performance, there is likely a very real bug or deficiency behind that.

It is likely early SRA that messes up, or maybe convert switch.  Early
passes should be really restricted to always profitable cleanups.

Your experiment looks useful to track down these bugs, but in general
I don't think we want to expose iterating early passes.

In these other more top-down examples of devirt I mention above, I agree with you. Once the CFG is ordered and the analyses happen, things should be propogated forward without issue. In the case of factory functions, my understanding and experience on this real-world codebase is that multiple passes are required. First, to "bubble up" the concrete type info coming out of the factory function. Depending on how many layers, it may require a couple. Second, to then forward propogate that concrete type information for the pointer.

There was a surprising side-effect when I started experimenting with this ipa-passes feature. In a module that contains ~100KLOC, I implemented mega-compilation (a poor-man's LTO). At two passes, the module got larger, which I expected. This minor growth continued with each additional pass, until at about 7 passes when it decreased by over 10%. I set up a script to run overnight to incrementally try passes and record the module size, and the "sweet spot" ended up being 54 passes as far as size. I took the three smallest binaries and did a full performance regression at the system level, and the smallest binary's inclusion resulted in an ~6% performance improvement (measured as overall network I/O throughput) while using less CPU on a Transmeta Crusoe-based appliance. (This is a web proxy, with about 500KLOC of other code that was not compiled in this new way.)

The idea of multiple passes resulting is a smaller binary and higher performance was like a dream. I reproduced a similar pattern on open source projects, namely scummvm (on which I was able to use proper LTO)*. That is, smaller binaries resulted as well as decreased CPU usage. On some projects, this could possibly be correlated with micro-level benchmarks such as reduced branch prediction and L1 cache misses as reported by callgrind.

While it's possible/probable that some of the performance improvements I saw by increasing ipa-passes were ultimately missed-optimization bugs that should be fixed, I'd be very surprised if *all* of those improvements were the case. As such, I would still like to see this exposed. I would be happy to file bugs and help test any instances where it looks like an optimization should have been gotten within a single ipa-pass.


Thanks for helping to get this feature (and the other devirt-related pieces) into 4.7 -- it's been a huge boon to improving our C++ designs without sacrificing performance.


* Note that that scummvm's "sweet spot" number of iterations was different. That being said, the default of three iterations to make the typical use of Factory pattern devirtualize correctly still resulted in improved performance over a single pass -- just not necessarily a smaller binary.



--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

Reply via email to