> > I have added FDO runs to the daily tramp3d tester and am observing > "intersting" things there. First of all, compile time with > -fprofile-generate (w/o leafify) skyrocketed from ~120s to 440s. > For reference, here's the hot spots in -ftime-report: > > life analysis : 24.66 ( 6%) usr 0.00 ( 0%) sys 24.52 ( 5%) > wall 16086 kB ( 0%) ggc > integration : 13.67 ( 3%) usr 0.05 ( 0%) sys 13.67 ( 3%) > wall 806431 kB (23%) ggc > tree PTA : 10.17 ( 2%) usr 0.10 ( 1%) sys 10.24 ( 2%) > wall 20425 kB ( 1%) ggc > tree SSA incremental : 19.58 ( 5%) usr 0.21 ( 2%) sys 20.28 ( 5%) > wall 27383 kB ( 1%) ggc > tree operand scan : 11.87 ( 3%) usr 4.51 (35%) sys 16.62 ( 4%) > wall 94887 kB ( 3%) ggc > dominator optimization: 16.60 ( 4%) usr 0.06 ( 0%) sys 16.24 ( 4%) > wall 210301 kB ( 6%) ggc > expand : 23.51 ( 5%) usr 0.10 ( 1%) sys 23.15 ( 5%) > wall 310872 kB ( 9%) ggc > CSE : 52.40 (12%) usr 0.05 ( 0%) sys 52.44 (12%) > wall 24796 kB ( 1%) ggc > loop analysis : 20.06 ( 5%) usr 0.12 ( 1%) sys 20.23 ( 5%) > wall 26703 kB ( 1%) ggc > CSE 2 : 25.68 ( 6%) usr 0.01 ( 0%) sys 25.88 ( 6%) > wall 1360 kB ( 0%) ggc > global alloc : 14.93 ( 3%) usr 0.08 ( 1%) sys 14.86 ( 3%) > wall 65979 kB ( 2%) ggc > reload CSE regs : 16.20 ( 4%) usr 0.04 ( 0%) sys 16.56 ( 4%) > wall 49571 kB ( 1%) ggc > rename registers : 10.76 ( 2%) usr 0.03 ( 0%) sys 10.67 ( 2%) > wall 6109 kB ( 0%) ggc > TOTAL : 434.71 12.95 448.78 > 3461889 kB > > look at those CSE numbers! (this is all with release checking only) > > 2nd, runtime of the profile generating binary raised by a factor of 50 > (this is just an -O2 compile, basically) > > Now, the interesting thing is, that with -fprofile-use, compile time > halved from the 120s to 62s. Nice. And the performance is exactly > the same as a non-FDO (non leafify) binary, which suggests, that we > can improve inlining heuristics wrt compile-time without regressing > in runtime performance. > > The profile generating numbers suggest we're either doing something > stupid, or that we want some heuristics applied to not instrument > every edge, but only interesting ones.
I would not want into busyness of having partially profiled programs. This makes the number of cases you have to think of bigger and results more dependent on heuristics. So lets try to concentrate on getting the costs down for the moment. For most program I would believe that the current algorithm to instrumentate only non-spanning-tree edges should work just well enough. I actually believe large port of the slowdown is the fact that we understand little the aliasing of counters. We assume them to be caller clobberred, aliased with each other and so on as we do on global variables. If there was some easy way to tell aliasing that these are well behaved, the SSA representation should simplify a lot. As for CSE, we are excercising the same problem - your testcase has the property that very many basic blocks before inliing gets merged together so we get large basic blocks with many increments of different global vars and because of the quadratic behaviour of our aliasing information there, we degrade. While this can, in theory, be solved similar way as for SSA representation by somehow bypassing the aliasing info, perhaps more consistently with our plan, we can trottle down the CSE tables. Irritantingly all CSE knows is to throw away all tables after fixed number of instructions set to 1000 right now, while it really wants something like trottling down the amount of entries in hashtable. Adding some counters on number of entries and trottling them down to lower number probably will work. I can try to look into it once I get out of issues I am swamped in right now unless someone beats me ;) Honza > > Richard.