Richard Biener <rguent...@suse.de> writes: > On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford > <richard.sandif...@arm.com> wrote: >>Richard Biener <rguent...@suse.de> writes: >>> On Thu, 18 Oct 2018, Richard Sandiford wrote: >>> >>>> Richard Biener <rguent...@suse.de> writes: >>>> > PR63155 made me pick up this old work from Steven, it turns our >>>> > linked-list implementation to a two-mode one with one being a >>>> > splay tree featuring O(log N) complexity for find/remove. >>>> > >>>> > Over Stevens original patch I added a bitmap_tree_to_vec helper >>>> > that I use from the debug/print methods to avoid changing view >>>> > there. In theory the bitmap iterator could get a "stack" >>>> > as well and we could at least support EXECUTE_IF_SET_IN_BITMAP. >>>> > >>>> > This can be used to fix the two biggest bottlenecks in the PRs >>>> > testcase, namely SSA propagator worklist handling and out-of-SSA >>>> > coalesce list building. perf shows the following data, first >>>> > unpatched, second patched - also watch the thrid coulumn (samples) >>>> > when comparing percentages. >>>> > >>>> > -O0 >>>> > - 18.19% 17.35% 407 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 8.77% create_coalesce_list_for_region >> ▒ >>>> > + 4.21% calculate_live_ranges >> ▒ >>>> > + 2.02% build_ssa_conflict_graph >> ▒ >>>> > + 1.66% insert_phi_nodes_for >> ▒ >>>> > + 0.86% coalesce_ssa_name >>>> > patched: >>>> > - 12.39% 10.48% 129 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 5.27% calculate_live_ranges >> ▒ >>>> > + 2.76% insert_phi_nodes_for >> ▒ >>>> > + 1.90% create_coalesce_list_for_region >> ▒ >>>> > + 1.63% build_ssa_conflict_graph >> ▒ >>>> > + 0.35% coalesce_ssa_name >>>> > >>>> > -O1 >>>> > - 17.53% 17.53% 842 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 12.39% add_ssa_edge >> ▒ >>>> > + 1.48% create_coalesce_list_for_region >> ▒ >>>> > + 0.82% solve_constraints >> ▒ >>>> > + 0.71% calculate_live_ranges >> ▒ >>>> > + 0.64% add_implicit_graph_edge >> ▒ >>>> > + 0.41% insert_phi_nodes_for >> ▒ >>>> > + 0.34% build_ssa_conflict_graph >>>> > patched: >>>> > - 5.79% 5.00% 167 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 1.41% add_ssa_edge >> ▒ >>>> > + 0.88% calculate_live_ranges >> ▒ >>>> > + 0.75% add_implicit_graph_edge >> ▒ >>>> > + 0.68% solve_constraints >> ▒ >>>> > + 0.48% insert_phi_nodes_for >> ▒ >>>> > + 0.45% build_ssa_conflict_graph >>>> > >>>> > -O3 >>>> > - 12.37% 12.34% 1145 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 9.14% add_ssa_edge >> ▒ >>>> > + 0.80% create_coalesce_list_for_region >> ▒ >>>> > + 0.69% add_implicit_graph_edge >> ▒ >>>> > + 0.54% solve_constraints >> ▒ >>>> > + 0.34% calculate_live_ranges >> ▒ >>>> > + 0.27% insert_phi_nodes_for >> ▒ >>>> > + 0.21% build_ssa_conflict_graph >>>> > - 4.36% 3.86% 227 cc1 cc1 [.] >>bitmap_set_b▒ >>>> > - bitmap_set_bit >> ▒ >>>> > + 0.98% add_ssa_edge >> ▒ >>>> > + 0.86% add_implicit_graph_edge >> ▒ >>>> > + 0.64% solve_constraints >> ▒ >>>> > + 0.57% calculate_live_ranges >> ▒ >>>> > + 0.32% build_ssa_conflict_graph >> ▒ >>>> > + 0.29% mark_all_vars_used_1 >> ▒ >>>> > + 0.20% insert_phi_nodes_for >> ▒ >>>> > + 0.16% create_coalesce_list_for_region >>>> > >>>> > >>>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. >>>> >>>> What's the performance like for more reasonable testcases? E.g. is >>there >>>> a measurable change either way in --enable-checking=release for some >>gcc >>>> .iis at -g or -O2 -g? >>> >>> I did a quick check on my set of cc1files (still .i, .ii ones tend to >>> be unusable quite quickly...). Most of them compile too quickly >>> to make any difference appear other than noise. Multi-second >>differences >>> like for PR63155 should be the exception but our O(n) bitmap >>> implementation really makes some parts of GCC quadratic where it >>> doesn't appear so. >>> >>> Is there a reason you expect it to be ever slower? >> >>During recent stage3s I've tried to look at profiles of cc1plus >>to see whether there was something easy we could do to improve >>compile times. And bitmap operations always showed up near the >>top of the profile. There were always some pathological queries >>in which the linear search really hurt, but whenever I tried "simple" >>ways to avoid the obvious cases, they made those queries quicker >>but slowed things down overall. It seemed that adding any extra logic >>to the queries hurted because even a small slowdown in common lookups >>overwhelmed a big saving in less common lookups. > > Yeah. I also noticed some 'obvious' shortcomings in the heuristics... > I guess in the end well predicted branches in the out of line code are > important... > >> >>But there again I was looking to speed up more "normal" cases, not ones >>like the PR. >> >>FWIW I've tried it on a local x86_64 box and it slows down current >>optabs.ii at -O2 -g by ~0.35% (reproducable). I see similar slowdowns >>for the other files I tried. But that's hardly a huge amount, and >>probably a price worth paying for the speed-up in the PR. > > Just to make sure what to reproduce - this is with checking disabled? > And with or without the hunks actually making use of the splay tree > path?
Yeah, with an --enable-checking=release stage3: ./cc1plus optabs.ii -o /dev/null -O2 -g using the optabs.ii from the unpatched --enable-checking=release build. It was the whole patch vs. without the patch. Thanks, Richard