> > I can go ahead with the histogram approach. There is some roundoff > > error from the working set scaling approach that can affect different > > merging orders as you note, although I think this only really affects the > > small counter values. The other place where saving/merging the histogram
Do you have any intuition on why simple maximalization merging (that is safe wrt ordering) would be bad idea? We care only about working set size around top of the histogram and I would say that we should sort of optimize for the largest (in the number of blocks in hot area) of the train runs. One way where things will get messed up is when the working set is about the same but the runs are of different size so all the blocks gets accounted into two different buckets. But in general I do not think there is resonably accurate way to merge the profiles without actually streaming out all counter IDs in every bucket, so perhaps this will work well enough. If not, we can in future introduce per-program global summary file that will contain the counters to be merged acurately. > > would help is when the distribution of counter values in the histogram > > varies between runs, say for example, the hottest loop is much hotter in a > > subsequent run, but the rest of the counter values stay largely consistent. > > Note, however, that if the hotspots are different in different runs, then > > merging either the histogram or the working set will have issues. The only > > way to detect this is to recompute the histogram/working set from scratch > > from the merged counters. > > > > I wonder in practice, even when there are a lot of simultaneous runs going > > like in a gcc bootstrap, if we could get reasonably accurate summary > > recomputation without global locking. The reason is that as long as the > > actual counter updates are safe as they are now due to the individual file > > locking, the inaccuracy in the recomputed summary information will not grow > > over time, and there is a reasonable chance that the last run to complete > > and merge will recompute the summary from the final merged counter values > > and get it right (or only be off by a little bit if there are a couple of > > runs completing simultaneously at the end). But this can be investigated as > > a follow on to this patch. > > > > > The main concern is probably the build reproducibility in gcc bootstrap > with FDO. Hmm, you mean in the first pass update every file with new counters and in the second pass just update the summaries? OK, so I guess we went through 1) two pass updating with race in between pases. 2) two pass updating with first pass updating countes and second having race only for summary update. (i.e. no races for counters) 3) two pass updating with flocking (and some way to handle detected deadlocks) 4) one pass updating with histogram merging + maximalization of working set. (we do not realy need to scale the buckets, we can simply merge the histograms and then mutiply them by nruns before comparing to actual counters. This assumes that working sets of all runs are about the same, but should work resonably in practice I think. I guess 3/4 are acceptable WRT bootstrap reproducibility. I have no experience with flocking large number of files and portability of this solution i.e. to Windows. If you think that 2) would be too inaccurate in practice and 3) has chance to be portable, we could go for this. It will solve the precision problems and will also work for LIPO summaries. I would be curious about effect on profiledbootstrap time of this if you implement it. Honza > > David > > > > > > > Thanks, > > Teresa > > > > > > >> > >> > >> > >> > >> > >> > 2) Do we plan to add some features in near future that will > >> anyway require global locking? > >> > >> > I guess LIPO itself does not count since it streams its data > >> into independent file as you > >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> > >> > Does LIPO stream everything into that common file, or does it > >> use combination of gcda files > >> > >> > and common summary? > >> > >> > >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> > >> It is also stored in a separate .imports file (one per object) --- > >> > >> this is primarily used by our build system for dependence > >> information. > >> > > > >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> LIPO behave > >> > > on GCC bootstrap? > >> > > >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> > main problem for application build -- the link time (for debug build) > >> > is. > >> > >> I was primarily curious how the LIPOs runtime analysis fare in the > >> situation where > >> you do very many small train runs on rather large app (sure GCC is small > >> to google's > >> use case ;). > >> > > >> > > (i.e. it does a lot more work in the libgcov module per each > >> > > invocation, so I am curious if it is practically useful at all). > >> > > > >> > > With LTO based solution a lot can be probably pushed at link time? > >> Before > >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> CFGs from > >> > > gcda files and do all the merging/updating/CFG constructions that is > >> currently > >> > > performed at runtime, right? > >> > > >> > The dynamic cgraph build and analysis is still done at runtime. > >> > However, with the new implementation, FE is no longer involved. Gcc > >> > driver is modified to understand module grouping, and lto is used to > >> > merge the streamed output from aux modules. > >> > >> I see. Are there any fundamental reasons why it can not be done at > >> link-time > >> when all gcda files are available? Why the grouping is not done inside > >> linker > >> plugin? > >> > >> Honza > >> > > >> > > >> > David > >> > > > > > > > > -- > > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413 > >