> > I'm concerned about inaccuracies arising when multiple runs have > different sizes and therefore counters are in different buckets in the > corresponding histograms. We would end up in a situation where the > merged histogram has more "counters" in it than there are actual > counters in the program.
Yes, it is very inaccurate. > > > > > The strategy of scaling is more sensible, but it will also be sensitive to > > order > > of train runs, i.e. it will give unstable results with parallel make. > > Regarding the approach of merging histograms by matching up counters > in the same order and apportioning out the histogram cumulative values > evenly between counters in the same bucket before merging, I don't > think the ordering will have a huge effect. The reason is that > counters we end up matching together and accumulating will stay in the > same relative order, and the min counter values are simply summed to > compute the new bucket index and new min counter value to record. For > the cumulative value, since the incoming cumulative values will be > apportioned evenly between the counters coming from the same bucket > when computing the new cumulative value (which is then just another > add), there could be some slight differences from different round off > errors when histograms are merged in different orders. But I would > expect that to result in only small differences in the cumulative > values in the histogram. Is that acceptable? Yes, I think it will be small difference and it will give resonable results in practice. I am only concerned about reproducibility of the bootstraps that... > Ok, I think that makes sense. > > If it is ok with you, I'd prefer to go with a 3 step approach: > 1) Implement some form of histogram merging (one of the options we > discussed above). (Your approach #4). > 2) Implement approach #2 (restructure into 2 passes with counters > being updated accurately and a possible race on the histogram > computation). > 3) Investigate and implement either approach #3 (some kind of global > lock) or the race detection you describe above. Yes, this sounds like good plan. > > > > > We will still have problems with bootstrap: the summary of libbackend.a will > > depend on what of compiler binaries has executed last, since cc1 will have > > different summary from cc1plus becuase frontend is different. I would give > > extra points to voluteer who changes libgcov to do multiple summaries for > > different programs as intended originally (if you search archives in > > 2001-2003, > > somewhere are patches that changes libgcov from not merging at all to > > merging > > always. I guess changing it to merge just when program checksum match is not > > that hard). > > In this case, I assume that currently the merge will abort because the > # counters, etc don't match? We only abort when number of counters in current file does not match (i.e. the gcda file is from difference compilation). We accept when the overall program is different, i.e. libbackend.a is linked into multiple binaries. Since the histogram summing is global across all profiled libraries, we will end up with different histograms depending on what program using given library was executed last. To get this stable I think it is only possible to go with #3 (i.e. program wide global locking) and get different records for different programs (i.e. do checksum of all objects libgcov sees and merge only with matching checksums). In coverage.c we then can either choose the proper record whendoing LTO and knowing what program we are just compiling or do hitogram merging in your way but after sorting the entries into some canonical order (i.e. increasing checksum). So we can track this incrementally. I guess if you go with step 1) as you describe and we can move ahead. Other thing I have patch for is to make LTO ipa-profile to produce more accurate histograms based on real instruction counts (here we can read in all gcda files, turn them into CFG profiles, and do our own summary). I suppose that once you merge in your patches I can update this and we will have two mechanizms so we will also have data on how your implementation diverges from precise solution. Thanks! Honza > > Thanks, > Teresa > > > > > Thanks, > > Honza > >> > >> Thanks, > >> Teresa > >> > >> > > >> > Honza > >> >> > >> >> David > >> >> > >> >> > >> >> > >> >> > > >> >> > Thanks, > >> >> > Teresa > >> >> > > >> >> > > > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> > 2) Do we plan to add some features in near future that will > >> >> >> anyway require global locking? > >> >> >> > >> > I guess LIPO itself does not count since it streams its > >> >> >> > >> > data > >> >> >> into independent file as you > >> >> >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> >> >> > >> > Does LIPO stream everything into that common file, or > >> >> >> > >> > does it > >> >> >> use combination of gcda files > >> >> >> > >> > and common summary? > >> >> >> > >> > >> >> >> > >> Actually, LIPO module grouping information are stored in gcda > >> >> >> > >> files. > >> >> >> > >> It is also stored in a separate .imports file (one per object) > >> >> >> > >> --- > >> >> >> > >> this is primarily used by our build system for dependence > >> >> >> information. > >> >> >> > > > >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How > >> >> >> > > does > >> >> >> LIPO behave > >> >> >> > > on GCC bootstrap? > >> >> >> > > >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not > >> >> >> > the > >> >> >> > main problem for application build -- the link time (for debug > >> >> >> > build) > >> >> >> > is. > >> >> >> > >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the > >> >> >> situation where > >> >> >> you do very many small train runs on rather large app (sure GCC is > >> >> >> small > >> >> >> to google's > >> >> >> use case ;). > >> >> >> > > >> >> >> > > (i.e. it does a lot more work in the libgcov module per each > >> >> >> > > invocation, so I am curious if it is practically useful at all). > >> >> >> > > > >> >> >> > > With LTO based solution a lot can be probably pushed at link > >> >> >> > > time? > >> >> >> Before > >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read > >> >> >> > > gcov > >> >> >> CFGs from > >> >> >> > > gcda files and do all the merging/updating/CFG constructions > >> >> >> > > that is > >> >> >> currently > >> >> >> > > performed at runtime, right? > >> >> >> > > >> >> >> > The dynamic cgraph build and analysis is still done at runtime. > >> >> >> > However, with the new implementation, FE is no longer involved. Gcc > >> >> >> > driver is modified to understand module grouping, and lto is used > >> >> >> > to > >> >> >> > merge the streamed output from aux modules. > >> >> >> > >> >> >> I see. Are there any fundamental reasons why it can not be done at > >> >> >> link-time > >> >> >> when all gcda files are available? Why the grouping is not done > >> >> >> inside > >> >> >> linker > >> >> >> plugin? > >> >> >> > >> >> >> Honza > >> >> >> > > >> >> >> > > >> >> >> > David > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Teresa Johnson | Software Engineer | tejohn...@google.com | > >> >> > 408-460-2413 > >> >> > > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413