I have added FDO runs to the daily tramp3d tester and am observing "intersting" things there. First of all, compile time with -fprofile-generate (w/o leafify) skyrocketed from ~120s to 440s. For reference, here's the hot spots in -ftime-report:
life analysis : 24.66 ( 6%) usr 0.00 ( 0%) sys 24.52 ( 5%) wall 16086 kB ( 0%) ggc integration : 13.67 ( 3%) usr 0.05 ( 0%) sys 13.67 ( 3%) wall 806431 kB (23%) ggc tree PTA : 10.17 ( 2%) usr 0.10 ( 1%) sys 10.24 ( 2%) wall 20425 kB ( 1%) ggc tree SSA incremental : 19.58 ( 5%) usr 0.21 ( 2%) sys 20.28 ( 5%) wall 27383 kB ( 1%) ggc tree operand scan : 11.87 ( 3%) usr 4.51 (35%) sys 16.62 ( 4%) wall 94887 kB ( 3%) ggc dominator optimization: 16.60 ( 4%) usr 0.06 ( 0%) sys 16.24 ( 4%) wall 210301 kB ( 6%) ggc expand : 23.51 ( 5%) usr 0.10 ( 1%) sys 23.15 ( 5%) wall 310872 kB ( 9%) ggc CSE : 52.40 (12%) usr 0.05 ( 0%) sys 52.44 (12%) wall 24796 kB ( 1%) ggc loop analysis : 20.06 ( 5%) usr 0.12 ( 1%) sys 20.23 ( 5%) wall 26703 kB ( 1%) ggc CSE 2 : 25.68 ( 6%) usr 0.01 ( 0%) sys 25.88 ( 6%) wall 1360 kB ( 0%) ggc global alloc : 14.93 ( 3%) usr 0.08 ( 1%) sys 14.86 ( 3%) wall 65979 kB ( 2%) ggc reload CSE regs : 16.20 ( 4%) usr 0.04 ( 0%) sys 16.56 ( 4%) wall 49571 kB ( 1%) ggc rename registers : 10.76 ( 2%) usr 0.03 ( 0%) sys 10.67 ( 2%) wall 6109 kB ( 0%) ggc TOTAL : 434.71 12.95 448.78 3461889 kB look at those CSE numbers! (this is all with release checking only) 2nd, runtime of the profile generating binary raised by a factor of 50 (this is just an -O2 compile, basically) Now, the interesting thing is, that with -fprofile-use, compile time halved from the 120s to 62s. Nice. And the performance is exactly the same as a non-FDO (non leafify) binary, which suggests, that we can improve inlining heuristics wrt compile-time without regressing in runtime performance. The profile generating numbers suggest we're either doing something stupid, or that we want some heuristics applied to not instrument every edge, but only interesting ones. Richard.