> 
> After few days of measurement and tuning, I was able to get numbers to the 
> following shape:
> Execution times (seconds)
>  phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
> wall    1412 kB ( 0%) ggc
>  phase opt and generate  :  27.83 (59%) usr   0.66 (19%) sys  28.52 (37%) 
> wall 1028813 kB (24%) ggc
>  phase stream in         :  16.90 (36%) usr   0.63 (18%) sys  17.60 (23%) 
> wall 3246453 kB (76%) ggc
>  phase stream out        :   2.76 ( 6%) usr   2.19 (63%) sys  31.34 (40%) 
> wall       2 kB ( 0%) ggc
>  callgraph optimization  :   0.36 ( 1%) usr   0.00 ( 0%) sys   0.35 ( 0%) 
> wall      40 kB ( 0%) ggc
>  ipa dead code removal   :   3.31 ( 7%) usr   0.01 ( 0%) sys   3.25 ( 4%) 
> wall       0 kB ( 0%) ggc
>  ipa virtual call target :   3.69 ( 8%) usr   0.03 ( 1%) sys   3.80 ( 5%) 
> wall      21 kB ( 0%) ggc
>  ipa devirtualization    :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) 
> wall   13704 kB ( 0%) ggc
>  ipa cp                  :   1.11 ( 2%) usr   0.07 ( 2%) sys   1.17 ( 2%) 
> wall  188558 kB ( 4%) ggc
>  ipa inlining heuristics :   8.17 (17%) usr   0.14 ( 4%) sys   8.27 (11%) 
> wall  494738 kB (12%) ggc
>  ipa comdats             :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) 
> wall       0 kB ( 0%) ggc
>  ipa lto gimple in       :   1.86 ( 4%) usr   0.40 (11%) sys   2.20 ( 3%) 
> wall  537970 kB (13%) ggc
>  ipa lto gimple out      :   0.19 ( 0%) usr   0.08 ( 2%) sys   0.27 ( 0%) 
> wall       2 kB ( 0%) ggc
>  ipa lto decl in         :  12.20 (26%) usr   0.37 (11%) sys  12.64 (16%) 
> wall 2441687 kB (57%) ggc
>  ipa lto decl out        :   2.51 ( 5%) usr   0.21 ( 6%) sys   2.71 ( 3%) 
> wall       0 kB ( 0%) ggc
>  ipa lto constructors in :   0.13 ( 0%) usr   0.02 ( 1%) sys   0.17 ( 0%) 
> wall   15692 kB ( 0%) ggc
>  ipa lto constructors out:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) 
> wall       0 kB ( 0%) ggc
>  ipa lto cgraph I/O      :   0.54 ( 1%) usr   0.09 ( 3%) sys   0.63 ( 1%) 
> wall  407182 kB (10%) ggc
>  ipa lto decl merge      :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.34 ( 2%) 
> wall    8220 kB ( 0%) ggc
>  ipa lto cgraph merge    :   1.00 ( 2%) usr   0.00 ( 0%) sys   1.00 ( 1%) 
> wall   14605 kB ( 0%) ggc
>  whopr wpa               :   0.92 ( 2%) usr   0.00 ( 0%) sys   0.89 ( 1%) 
> wall       1 kB ( 0%) ggc
>  whopr wpa I/O           :   0.01 ( 0%) usr   1.90 (55%) sys  28.31 (37%) 
> wall       0 kB ( 0%) ggc
>  whopr partitioning      :   2.81 ( 6%) usr   0.01 ( 0%) sys   2.83 ( 4%) 
> wall    4943 kB ( 0%) ggc
>  ipa reference           :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.35 ( 2%) 
> wall       0 kB ( 0%) ggc
>  ipa profile             :   0.20 ( 0%) usr   0.01 ( 0%) sys   0.21 ( 0%) 
> wall       0 kB ( 0%) ggc
>  ipa pure const          :   1.62 ( 3%) usr   0.00 ( 0%) sys   1.63 ( 2%) 
> wall       0 kB ( 0%) ggc
>  ipa icf                 :   2.65 ( 6%) usr   0.02 ( 1%) sys   2.68 ( 3%) 
> wall    1352 kB ( 0%) ggc
>  inline parameters       :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) 
> wall       0 kB ( 0%) ggc
>  tree SSA rewrite        :   0.11 ( 0%) usr   0.01 ( 0%) sys   0.08 ( 0%) 
> wall   18919 kB ( 0%) ggc
>  tree SSA other          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) 
> wall       0 kB ( 0%) ggc
>  tree SSA incremental    :   0.24 ( 1%) usr   0.01 ( 0%) sys   0.32 ( 0%) 
> wall   11325 kB ( 0%) ggc
>  tree operand scan       :   0.15 ( 0%) usr   0.02 ( 1%) sys   0.18 ( 0%) 
> wall  116283 kB ( 3%) ggc
>  dominance frontiers     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) 
> wall       0 kB ( 0%) ggc
>  dominance computation   :   0.13 ( 0%) usr   0.01 ( 0%) sys   0.16 ( 0%) 
> wall       0 kB ( 0%) ggc
>  varconst                :   0.01 ( 0%) usr   0.02 ( 1%) sys   0.01 ( 0%) 
> wall       0 kB ( 0%) ggc
>  loop fini               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) 
> wall       0 kB ( 0%) ggc
>  unaccounted todo        :   0.55 ( 1%) usr   0.00 ( 0%) sys   0.56 ( 1%) 
> wall       0 kB ( 0%) ggc
>  TOTAL                 :  47.49             3.48            77.46            
> 4276682 kB
> 
> and I was able to reduce function bodies loaded in WPA to 35% (from previous 
> 55%). The main problem

35% means that 35% of all function bodies are compared with something else? 
That feels pretty high.
but overall numbers are not so terrible.

> with speed was hidden in work list for congruence classes, where hash_set was 
> used. I chose the data
> structure to support delete operation, but it was really slow. Thus, hash_set 
> was replaced with linked list
> and a flag is used to identify if a set is removed or not.

Interesting, I would not expect bottleneck in a congruence solving :)
> 
> I have no clue who complicated can it be to implement release_body function 
> to an operation that
> really releases the memory?

I suppose one can keep the caches from streamer and free trees read.  Freeing
gimple statemnts, cfg should be relatively easy. 

Lets however first try to tune the implementation rather than try to this hack
implemented. Explicit ggc_free calls traditionally tended to cause some negative
reactions wrt memory fragmentation concerns.

> 
> Markus' problem with -fprofile-use has been removed, IPA-ICF is preceding 
> devirtualization pass. I hope it is fine?

Yes, I think devirtualization should actually work better with identical
virutal methods merged.  We just need to be sure it sees through the newly
introduced aliases (there should be no thunks for virutal methods)

Honza

Reply via email to