> Teresa has done some tunings for the unroller so far. The inliner
> tuning is the next step.
> 
> >
> > What concerns me that it is greatly inaccurate - you have no idea how many
> > instructions given counter is guarding and it can differ quite a lot. Also
> > inlining/optimization makes working sets significantly different (by factor 
> > of
> > 100 for tramp3d).
> 
> The pre ipa-inline working set is the one that is needed for ipa
> inliner tuning. For post-ipa inline code increase transformations,
> some update is probably needed.
> 
> >But on the ohter hand any solution at this level will be
> > greatly inaccurate. So I am curious how reliable data you can get from this?
> > How you take this into account for the heuristics?
> 
> This effort is just the first step to allow good heuristics to develop.
> 
> >
> > It seems to me that for this use perhaps the simple logic in histogram 
> > merging
> > maximizing the number of BBs for given bucket will work well?  It is
> > inaccurate, but we are working with greatly inaccurate data anyway.
> > Except for degenerated cases, the small and unimportant runs will have 
> > small BB
> > counts, while large runs will have larger counts and those are ones we 
> > optimize
> > for anyway.
> 
> The working set curve for each type of applications contains lots of
> information that can be mined. The inaccuracy can also be mitigated by
> more data 'calibration'.

Sure, I think I am leaning towards trying the solution 2) with maximizing
counter count merging (probably it would make sense to rename it from BB count
since it is not really BB count and thus it is misleading) and we will see how
well it works in practice.

We have benefits of much fewer issues with profile locking/unlocking and we
lose bit of precision on BB counts. I tend to believe that the error will not
be that important in practice. Another loss is more histogram streaming into
each gcda file, but with skiping zero entries it should not be major overhead
problem I hope.

What do you think?
> 
> >>
> >>
> >> >  2) Do we plan to add some features in near future that will anyway 
> >> > require global locking?
> >> >     I guess LIPO itself does not count since it streams its data into 
> >> > independent file as you
> >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >     Does LIPO stream everything into that common file, or does it use 
> >> > combination of gcda files
> >> >     and common summary?
> >>
> >> Actually, LIPO module grouping information are stored in gcda files.
> >> It is also stored in a separate .imports file (one per object) ---
> >> this is primarily used by our build system for dependence information.
> >
> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO 
> > behave
> > on GCC bootstrap?
> 
> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> main problem for application build -- the link time (for debug build)
> is.

I was primarily curious how the LIPOs runtime analysis fare in the situation 
where
you do very many small train runs on rather large app (sure GCC is small to 
google's
use case ;).
> 
> > (i.e. it does a lot more work in the libgcov module per each
> > invocation, so I am curious if it is practically useful at all).
> >
> > With LTO based solution a lot can be probably pushed at link time? Before
> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs 
> > from
> > gcda files and do all the merging/updating/CFG constructions that is 
> > currently
> > performed at runtime, right?
> 
> The dynamic cgraph build and analysis is still done at runtime.
> However, with the new implementation, FE is no longer involved. Gcc
> driver is modified to understand module grouping, and lto is used to
> merge the streamed output from aux modules.

I see. Are there any fundamental reasons why it can not be done at link-time
when all gcda files are available? Why the grouping is not done inside linker
plugin?

Honza
> 
> 
> David

Reply via email to