> 
> I have added FDO runs to the daily tramp3d tester and am observing
> "intersting" things there.  First of all, compile time with
> -fprofile-generate (w/o leafify) skyrocketed from ~120s to 440s.
> For reference, here's the hot spots in -ftime-report:
> 
>  life analysis         :  24.66 ( 6%) usr   0.00 ( 0%) sys  24.52 ( 5%) 
> wall   16086 kB ( 0%) ggc
>  integration           :  13.67 ( 3%) usr   0.05 ( 0%) sys  13.67 ( 3%) 
> wall  806431 kB (23%) ggc
>  tree PTA              :  10.17 ( 2%) usr   0.10 ( 1%) sys  10.24 ( 2%) 
> wall   20425 kB ( 1%) ggc
>  tree SSA incremental  :  19.58 ( 5%) usr   0.21 ( 2%) sys  20.28 ( 5%) 
> wall   27383 kB ( 1%) ggc
>  tree operand scan     :  11.87 ( 3%) usr   4.51 (35%) sys  16.62 ( 4%) 
> wall   94887 kB ( 3%) ggc
>  dominator optimization:  16.60 ( 4%) usr   0.06 ( 0%) sys  16.24 ( 4%) 
> wall  210301 kB ( 6%) ggc
>  expand                :  23.51 ( 5%) usr   0.10 ( 1%) sys  23.15 ( 5%) 
> wall  310872 kB ( 9%) ggc
>  CSE                   :  52.40 (12%) usr   0.05 ( 0%) sys  52.44 (12%) 
> wall   24796 kB ( 1%) ggc
>  loop analysis         :  20.06 ( 5%) usr   0.12 ( 1%) sys  20.23 ( 5%) 
> wall   26703 kB ( 1%) ggc
>  CSE 2                 :  25.68 ( 6%) usr   0.01 ( 0%) sys  25.88 ( 6%) 
> wall    1360 kB ( 0%) ggc
>  global alloc          :  14.93 ( 3%) usr   0.08 ( 1%) sys  14.86 ( 3%) 
> wall   65979 kB ( 2%) ggc
>  reload CSE regs       :  16.20 ( 4%) usr   0.04 ( 0%) sys  16.56 ( 4%) 
> wall   49571 kB ( 1%) ggc
>  rename registers      :  10.76 ( 2%) usr   0.03 ( 0%) sys  10.67 ( 2%) 
> wall    6109 kB ( 0%) ggc
>  TOTAL                 : 434.71            12.95           448.78            
> 3461889 kB
> 
> look at those CSE numbers! (this is all with release checking only)
> 
> 2nd, runtime of the profile generating binary raised by a factor of 50
> (this is just an -O2 compile, basically)
> 
> Now, the interesting thing is, that with -fprofile-use, compile time
> halved from the 120s to 62s.  Nice.  And the performance is exactly
> the same as a non-FDO (non leafify) binary, which suggests, that we
> can improve inlining heuristics wrt compile-time without regressing
> in runtime performance.
> 
> The profile generating numbers suggest we're either doing something
> stupid, or that we want some heuristics applied to not instrument
> every edge, but only interesting ones.

I would not want into busyness of having partially profiled programs.
This makes the number of cases you have to think of bigger and results
more dependent on heuristics.  So lets try to concentrate on getting the
costs down for the moment.  For most program I would believe that the
current algorithm to instrumentate only non-spanning-tree edges should
work just well enough.

I actually believe large port of the slowdown is the fact that we
understand little the aliasing of counters.  We assume them to be caller
clobberred, aliased with each other and so on as we do on global
variables.   If there was some easy way to tell aliasing that these are
well behaved, the SSA representation should simplify a lot.

As for CSE, we are excercising the same problem - your testcase has the
property that very many basic blocks before inliing gets merged together
so we get large basic blocks with many increments of different global
vars and because of the quadratic behaviour of our aliasing information
there, we degrade.

While this can, in theory, be solved similar way as for SSA
representation by somehow bypassing the aliasing info, perhaps more
consistently with our plan, we can trottle down the CSE tables.
Irritantingly all CSE knows is to throw away all tables after fixed
number of instructions set to 1000 right now, while it really wants
something like trottling down the amount of entries in hashtable.
Adding some counters on number of entries and trottling them down to
lower number probably will work.  I can try to look into it once I get
out of issues I am swamped in right now unless someone beats me ;)

Honza
> 
> Richard.

Reply via email to