Then you'd have to analyze the compile-time impact of the IPA
splitting on its own when not iterating. ?Then you should look
at what actually was the optimizations that were performed
that lead to the improvement (I can see some indirect inlining
happening, but everything else would be a bug in present
optimizers in the early pipeline - they are all designed to be
roughly independent on each other and _not_ expose new
opportunities by iteration). ?Thus - testcases?
The initial motivation for the patch was to enable more indirect
inlining and devirtualization opportunities.
Hm.
It is the proprietary codebase of my employer that these optimizations
were developed for. Multiple iterations specifically helps propogate the
concrete type information from functions that implement the
Abstract Factory design pattern, allowing for cleaner runtime dynamic
dispatch. I can verify that in said codebase (and in the reduced,
non-proprietary examples Maxim provided earlier in the year) it works
quite effectively.
Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }
While that strategy isn't unheard of, it implies a link-time substitution
to inject new/different sub-classes of the parameterized interface.
Besides limiting extensibility by requiring a rebuild/relink, it also
presupposes that two different implementations would be mutually exclusive
for that module. That is often not the case, hence the factory pattern
expressed in the other examples Maxim provided.
Since then I found the patch to be helpful in searching for
optimization
opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2
additional iterations of early optimizations [*]. ?Given that applying
more optimizations should, theoretically, not decrease performance, there
is likely a very real bug or deficiency behind that.
It is likely early SRA that messes up, or maybe convert switch. Early
passes should be really restricted to always profitable cleanups.
Your experiment looks useful to track down these bugs, but in general
I don't think we want to expose iterating early passes.
In these other more top-down examples of devirt I mention above, I agree
with you. Once the CFG is ordered and the analyses happen, things should
be propogated forward without issue. In the case of factory functions, my
understanding and experience on this real-world codebase is that multiple
passes are required. First, to "bubble up" the concrete type info coming
out of the factory function. Depending on how many layers, it may require
a couple. Second, to then forward propogate that concrete type information
for the pointer.
There was a surprising side-effect when I started experimenting with this
ipa-passes feature. In a module that contains ~100KLOC, I implemented
mega-compilation (a poor-man's LTO). At two passes, the module got larger,
which I expected. This minor growth continued with each additional pass,
until at about 7 passes when it decreased by over 10%. I set up a script
to run overnight to incrementally try passes and record the module size,
and the "sweet spot" ended up being 54 passes as far as size. I took the
three smallest binaries and did a full performance regression at the
system level, and the smallest binary's inclusion resulted in an ~6%
performance improvement (measured as overall network I/O throughput) while
using less CPU on a Transmeta Crusoe-based appliance. (This is a web
proxy, with about 500KLOC of other code that was not compiled in this new
way.)
The idea of multiple passes resulting is a smaller binary and higher
performance was like a dream. I reproduced a similar pattern on open
source projects, namely scummvm (on which I was able to use proper LTO)*.
That is, smaller binaries resulted as well as decreased CPU usage. On some
projects, this could possibly be correlated with micro-level benchmarks
such as reduced branch prediction and L1 cache misses as reported by
callgrind.
While it's possible/probable that some of the performance improvements I
saw by increasing ipa-passes were ultimately missed-optimization bugs that
should be fixed, I'd be very surprised if *all* of those improvements were
the case. As such, I would still like to see this exposed. I would be
happy to file bugs and help test any instances where it looks like an
optimization should have been gotten within a single ipa-pass.
Thanks for helping to get this feature (and the other devirt-related
pieces) into 4.7 -- it's been a huge boon to improving our C++ designs
without sacrificing performance.
* Note that that scummvm's "sweet spot" number of iterations was
different. That being said, the default of three iterations to make the
typical use of Factory pattern devirtualize correctly still resulted in
improved performance over a single pass -- just not necessarily a smaller
binary.
--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt