On 09/27/2014 07:59 AM, Markus Trippelsdorf wrote: > On 2014.09.27 at 01:27 +0200, Jan Hubicka wrote: >>> While a plain Firefox -flto build works fine. LTO/PGO build fails with: >>> >>> lto1: internal compiler error: in ipa_merge_profiles, at ipa-utils.c:540 >>> 0x7d6165 ipa_merge_profiles(cgraph_node*, cgraph_node*) >>> ../../gcc/gcc/ipa-utils.c:540 >>> 0xf10c41 ipa_icf::sem_function::merge(ipa_icf::sem_item*) >>> ../../gcc/gcc/ipa-icf.c:753 >>> 0xf15206 ipa_icf::sem_item_optimizer::merge_classes(unsigned int) >>> ../../gcc/gcc/ipa-icf.c:2706 >>> 0xf1c1f4 ipa_icf::sem_item_optimizer::execute() >>> ../../gcc/gcc/ipa-icf.c:2098 >>> 0xf1d3f1 ipa_icf_driver >>> ../../gcc/gcc/ipa-icf.c:2784 >>> 0xf1d3f1 ipa_icf::pass_ipa_icf::execute(function*) >>> ../../gcc/gcc/ipa-icf.c:2831 >>> >>> >>> The pass is also very memory hungry (from 3GB without ICF to 4GB during >>> libxul link), while the code size savings are in the 1% range. >> >> Thnks for checking. I was just thinking about doing that myself. Would >> you mind posting -ftime-report of firefox WPA stage? > > (without ICF) > Execution times (seconds) > phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall 1412 kB ( 0%) ggc > phase opt and generate : 58.38 (63%) usr 2.00 (47%) sys 60.37 (40%) > wall 403069 kB (12%) ggc > phase stream in : 30.24 (33%) usr 0.97 (23%) sys 33.90 (22%) > wall 2944210 kB (88%) ggc > phase stream out : 4.29 ( 5%) usr 1.32 (31%) sys 57.32 (38%) > wall 0 kB ( 0%) ggc > phase finalize : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) > wall 0 kB ( 0%) ggc > garbage collection : 3.68 ( 4%) usr 0.00 ( 0%) sys 3.68 ( 2%) > wall 0 kB ( 0%) ggc > callgraph optimization : 0.50 ( 1%) usr 0.00 ( 0%) sys 0.50 ( 0%) > wall 166 kB ( 0%) ggc > ipa dead code removal : 6.91 ( 7%) usr 0.08 ( 2%) sys 7.25 ( 5%) > wall 0 kB ( 0%) ggc > ipa virtual call target : 7.08 ( 8%) usr 0.04 ( 1%) sys 6.93 ( 5%) > wall 0 kB ( 0%) ggc > ipa devirtualization : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) > wall 10365 kB ( 0%) ggc > ipa cp : 1.81 ( 2%) usr 0.06 ( 1%) sys 3.40 ( 2%) > wall 173701 kB ( 5%) ggc > ipa inlining heuristics : 16.60 (18%) usr 0.27 ( 6%) sys 17.48 (12%) > wall 532704 kB (16%) ggc > ipa comdats : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.19 ( 0%) > wall 0 kB ( 0%) ggc > ipa lto gimple out : 0.21 ( 0%) usr 0.04 ( 1%) sys 0.97 ( 1%) > wall 0 kB ( 0%) ggc > ipa lto decl in : 18.29 (20%) usr 0.54 (13%) sys 18.96 (12%) > wall 2226088 kB (66%) ggc > ipa lto decl out : 3.93 ( 4%) usr 0.13 ( 3%) sys 4.06 ( 3%) > wall 0 kB ( 0%) ggc > ipa lto constructors in : 0.24 ( 0%) usr 0.03 ( 1%) sys 0.59 ( 0%) > wall 14226 kB ( 0%) ggc > ipa lto constructors out: 0.08 ( 0%) usr 0.04 ( 1%) sys 0.15 ( 0%) > wall 0 kB ( 0%) ggc > ipa lto cgraph I/O : 0.89 ( 1%) usr 0.12 ( 3%) sys 1.02 ( 1%) > wall 364151 kB (11%) ggc > ipa lto decl merge : 2.14 ( 2%) usr 0.01 ( 0%) sys 2.14 ( 1%) > wall 8196 kB ( 0%) ggc > ipa lto cgraph merge : 1.59 ( 2%) usr 0.00 ( 0%) sys 1.60 ( 1%) > wall 12716 kB ( 0%) ggc > whopr wpa : 1.54 ( 2%) usr 0.03 ( 1%) sys 1.55 ( 1%) > wall 1 kB ( 0%) ggc > whopr wpa I/O : 0.04 ( 0%) usr 1.11 (26%) sys 52.10 (34%) > wall 0 kB ( 0%) ggc > whopr partitioning : 5.02 ( 5%) usr 0.01 ( 0%) sys 5.03 ( 3%) > wall 4938 kB ( 0%) ggc > ipa reference : 2.04 ( 2%) usr 0.02 ( 0%) sys 2.08 ( 1%) > wall 0 kB ( 0%) ggc > ipa profile : 0.32 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) > wall 0 kB ( 0%) ggc > ipa pure const : 2.43 ( 3%) usr 0.02 ( 0%) sys 2.49 ( 2%) > wall 0 kB ( 0%) ggc > tree STMT verifier : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) > wall 0 kB ( 0%) ggc > callgraph verifier : 16.31 (18%) usr 1.69 (39%) sys 17.96 (12%) > wall 0 kB ( 0%) ggc > dominance computation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) > wall 0 kB ( 0%) ggc > varconst : 0.01 ( 0%) usr 0.03 ( 1%) sys 0.05 ( 0%) > wall 0 kB ( 0%) ggc > unaccounted todo : 0.69 ( 1%) usr 0.00 ( 0%) sys 0.69 ( 0%) > wall 0 kB ( 0%) ggc > TOTAL : 92.91 4.29 151.73 > 3348693 kB > Extra diagnostic checks enabled; compiler may run slowly. > Configure with --enable-checking=release to disable checks. > > (with ICF) > Execution times (seconds) > phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall 1412 kB ( 0%) ggc > phase opt and generate : 82.70 (70%) usr 3.31 (53%) sys 86.17 (45%) > wall 1468975 kB (33%) ggc > phase stream in : 30.46 (26%) usr 1.02 (16%) sys 31.48 (16%) > wall 2944210 kB (67%) ggc > phase stream out : 4.52 ( 4%) usr 1.90 (30%) sys 73.47 (38%) > wall 12 kB ( 0%) ggc > phase finalize : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) > wall 0 kB ( 0%) ggc > garbage collection : 7.01 ( 6%) usr 0.00 ( 0%) sys 6.99 ( 4%) > wall 0 kB ( 0%) ggc > callgraph optimization : 0.49 ( 0%) usr 0.00 ( 0%) sys 0.50 ( 0%) > wall 166 kB ( 0%) ggc > ipa dead code removal : 6.98 ( 6%) usr 0.13 ( 2%) sys 6.89 ( 4%) > wall 0 kB ( 0%) ggc > ipa virtual call target : 6.93 ( 6%) usr 0.03 ( 0%) sys 7.20 ( 4%) > wall 6 kB ( 0%) ggc > ipa devirtualization : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) > wall 10365 kB ( 0%) ggc > ipa cp : 1.87 ( 2%) usr 0.11 ( 2%) sys 2.00 ( 1%) > wall 167204 kB ( 4%) ggc > ipa inlining heuristics : 17.15 (15%) usr 0.21 ( 3%) sys 17.35 ( 9%) > wall 512636 kB (12%) ggc > ipa comdats : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.19 ( 0%) > wall 0 kB ( 0%) ggc > ipa lto gimple in : 5.17 ( 4%) usr 1.04 (17%) sys 6.51 ( 3%) > wall 855058 kB (19%) ggc > ipa lto gimple out : 0.38 ( 0%) usr 0.08 ( 1%) sys 3.07 ( 2%) > wall 12 kB ( 0%) ggc > ipa lto decl in : 18.38 (16%) usr 0.56 ( 9%) sys 18.95 (10%) > wall 2226088 kB (50%) ggc > ipa lto decl out : 3.95 ( 3%) usr 0.08 ( 1%) sys 4.03 ( 2%) > wall 0 kB ( 0%) ggc > ipa lto constructors in : 0.29 ( 0%) usr 0.01 ( 0%) sys 0.29 ( 0%) > wall 14389 kB ( 0%) ggc > ipa lto constructors out: 0.10 ( 0%) usr 0.03 ( 0%) sys 0.58 ( 0%) > wall 0 kB ( 0%) ggc > ipa lto cgraph I/O : 0.91 ( 1%) usr 0.10 ( 2%) sys 1.02 ( 1%) > wall 364151 kB ( 8%) ggc > ipa lto decl merge : 2.14 ( 2%) usr 0.00 ( 0%) sys 2.14 ( 1%) > wall 8196 kB ( 0%) ggc > ipa lto cgraph merge : 1.65 ( 1%) usr 0.01 ( 0%) sys 1.66 ( 1%) > wall 12716 kB ( 0%) ggc > whopr wpa : 1.81 ( 2%) usr 0.01 ( 0%) sys 1.85 ( 1%) > wall 1 kB ( 0%) ggc > whopr wpa I/O : 0.05 ( 0%) usr 1.71 (27%) sys 65.75 (34%) > wall 0 kB ( 0%) ggc > whopr partitioning : 5.05 ( 4%) usr 0.00 ( 0%) sys 5.06 ( 3%) > wall 5012 kB ( 0%) ggc > ipa reference : 2.13 ( 2%) usr 0.03 ( 0%) sys 2.16 ( 1%) > wall 0 kB ( 0%) ggc > ipa profile : 0.32 ( 0%) usr 0.01 ( 0%) sys 0.33 ( 0%) > wall 0 kB ( 0%) ggc > ipa pure const : 2.57 ( 2%) usr 0.00 ( 0%) sys 2.56 ( 1%) > wall 0 kB ( 0%) ggc > ipa icf : 6.88 ( 6%) usr 0.08 ( 1%) sys 7.01 ( 4%) > wall 855 kB ( 0%) ggc > tree SSA rewrite : 0.23 ( 0%) usr 0.06 ( 1%) sys 0.28 ( 0%) > wall 33946 kB ( 1%) ggc > tree SSA incremental : 0.42 ( 0%) usr 0.05 ( 1%) sys 0.53 ( 0%) > wall 21099 kB ( 0%) ggc > tree operand scan : 0.47 ( 0%) usr 0.08 ( 1%) sys 0.34 ( 0%) > wall 181275 kB ( 4%) ggc > tree STMT verifier : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall 0 kB ( 0%) ggc > callgraph verifier : 22.76 (19%) usr 1.68 (27%) sys 24.44 (13%) > wall 0 kB ( 0%) ggc > dominance frontiers : 0.02 ( 0%) usr 0.01 ( 0%) sys 0.04 ( 0%) > wall 0 kB ( 0%) ggc > dominance computation : 0.19 ( 0%) usr 0.05 ( 1%) sys 0.25 ( 0%) > wall 0 kB ( 0%) ggc > varconst : 0.04 ( 0%) usr 0.01 ( 0%) sys 0.05 ( 0%) > wall 0 kB ( 0%) ggc > loop fini : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) > wall 0 kB ( 0%) ggc > unaccounted todo : 0.82 ( 1%) usr 0.00 ( 0%) sys 0.81 ( 0%) > wall 0 kB ( 0%) ggc > TOTAL : 117.68 6.23 191.15 > 4414612 kB > Extra diagnostic checks enabled; compiler may run slowly. > Configure with --enable-checking=release to disable checks. > >> It seems that in this case we reject too many of equality candidates? >> It think the original numbers was about 4-5% but later some equivalences was >> disabled because of devirt/aliasing issues. Do you compare it with gold ICF >> enabled? There are quite few obvious improvements to the analysis that can >> be done, but I guess we need to analyze the interesting cases one by one. > > Gold ICF was enabled (-Wl,--icf=all,--icf-iterations=3). >
Hi. Thank you Markus for presenting numbers, it corresponds with I measured. If I see correctly, IPA ICF pass takes about 7 seconds, the rest is distributed in verifier (not interesting for release version of the compiler) and 'phase opt and generate'. No idea what can make the difference? Martin