On Fri, Apr 30, 2010 at 11:12 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> > >> > Interesting. My plan for profiling with LTO is to ultimately make it >> > linktime >> > transform. This will be more difficult with WHOPR (i.e. instrumenting need >> > function bodies that are not available at WPA time), but I believe it is >> > solvable: just assign uids to the edges and do instrumentation at ltrans. >> > Then >> > we will save cgraph profile in some easier way so WHOPR can read it in and >> > read >> > rest of stuff in ltrans. This would invovlve shipping the correct >> > profiles for >> > given function etc so it will be a bit of implementation challenge. >> >> This can be tricky -- to maximize FDO benefit, the >> profile-use/annotation needs to happen early which means >> instrumentation also needs to happen early (to avoid cfg mismatches). > > I don't see much problem in this particular area. > > GCC optimization queue is organized in a way that we first do early > optimizatoins that all are intended to be simple cleanups without size/speed > tradeoffs. Then we do IPA and late optimizations that are both driven by > profile (estimated or read). > Profile reading happens early because we use same infrastructure for gcov and > profile feedback. This is not giving profile feedback better benefit, quite a > converse since early passes may not be able to update profile precisely and we > also get higher profile overhead. > > So I think decoupling gcov and profile feedback and pushing profile feedback > back in queue is going to be win. >
There are two parts of profile-feedback 1) cfg edge counts annotation. For this part, yes, most of the early phases (other than possibly einline-2) do not need/depend on, and can probably pushed back (in fact the static/guessed profile pass is later). 2) value profile transformations: This part may benefit more from doing early -- not only because of more cleanups, but also due to the requirement for getting more precise inline summary. > Yes, optimization must match, but with LTO this is not problem and in general > the early optimization should be stable wrt memory layout (nothing else > changes). This used to be excercised before profiling was updated to tree > level in 4.x. You mean CFG layout is stable? but ccp, copy_prop, dce, tail recursion etc all can change cfg. > > I would be very interested in the low overhead support - there is a lot to > gain > especially because the profiling resuls are less dependent on setup and can be > better reused. I know part of code was contributed (the support for reading > not > 100% valid profiles). Is there any extra info available on this? > For profile smoothing, Neil may point to more information. > Main problem IMO is how to get profile into WHOPR without having function > bodies. > I guess we will end up with summarizing the info in WHOR firendly way and > letting it to stream the other counters to LTRANS that will annotate the > function > body once read in from the file. >> I am a little lost here :) >> >> > >> >> 2) comdat function resolution -- since LIPO uses aux module functions >> >> for inlining purpose only, it has the freedom to choose which copy to >> >> use. The current scheme chooses copy in current module with priority >> >> for better profile data context sensitivity (see below) >> > >> > This is interesting. How do you solve the problem when given comdat >> > function >> > "loose"? I.e. it is replaced at linktime by other function that may or may >> > not be profiled from other unit? >> >> Whatever function that is selected will have profile data (assuming it >> called at runtime) -- but the profile data are merged from different >> contexts including from calls in different modules. For instance, >> both a.C and b.C define foo. and b.C:foo is selected at runtime, and >> a.C:foo is not inlined (after instrumentation) anywhere in a.C, then >> a.C:foo won't have any profile data, and b.C:foo has merged profile >> data resulting from calls in both a.C and b.C. > > Yes, but this is what I am concerned about. Without LTO at least when > compiling a.C with profile feedback we will have foo with 0 counts. > We might however work out that calls of foo are frequent and decide to > inline foo. We will take the counts and rescale resulting in inlining > foo optimized for size Not always ideal though -- scaling does not expose whether foo is hot or not (the call edge may be cold, but is still worth inlining). . > > When comdats are resolved within LTO, this will not be deal, but LTO > still produce comdats that are later resolved with library etc., so we don't > solve the problem this way. > At very least we should be able to figure out that we are having function > that has no profile and do something more sane. You mean LTO does not discard duplicate bodies? Why ? > > Do you have any idea how common these scenarios are? I don't have direct data, but I think it can be common. Thanks, David > > Honza >