Richard Biener <rguent...@suse.de> writes: > On Fri, 19 Oct 2018, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford >> > <richard.sandif...@arm.com> wrote: >> >>During recent stage3s I've tried to look at profiles of cc1plus >> >>to see whether there was something easy we could do to improve >> >>compile times. And bitmap operations always showed up near the >> >>top of the profile. There were always some pathological queries >> >>in which the linear search really hurt, but whenever I tried "simple" >> >>ways to avoid the obvious cases, they made those queries quicker >> >>but slowed things down overall. It seemed that adding any extra logic >> >>to the queries hurted because even a small slowdown in common lookups >> >>overwhelmed a big saving in less common lookups. >> > >> > Yeah. I also noticed some 'obvious' shortcomings in the heuristics... >> > I guess in the end well predicted branches in the out of line code are >> > important... >> > >> >> >> >>But there again I was looking to speed up more "normal" cases, not ones >> >>like the PR. >> >> >> >>FWIW I've tried it on a local x86_64 box and it slows down current >> >>optabs.ii at -O2 -g by ~0.35% (reproducable). I see similar slowdowns >> >>for the other files I tried. But that's hardly a huge amount, and >> >>probably a price worth paying for the speed-up in the PR. >> > >> > Just to make sure what to reproduce - this is with checking disabled? >> > And with or without the hunks actually making use of the splay tree >> > path? >> >> Yeah, with an --enable-checking=release stage3: >> >> ./cc1plus optabs.ii -o /dev/null -O2 -g >> >> using the optabs.ii from the unpatched --enable-checking=release build. >> >> It was the whole patch vs. without the patch. > > OK, so there are almost no hits from the SSA propagator or out-of-SSA > but only from "unchanged" paths: > > - 2.90% 2.90% 23 cc1plus cc1plus [.] > bitmap_set_b▒ > - bitmap_set_bit > > ▒ > + 0.79% df_lr_bb_local_compute > > ▒ > + 0.38% insert_updated_phi_nodes_for > > ▒ > + 0.27% sched_analyze_reg > > ▒ > + 0.23% walk_aliased_vdefs_1 > > ▒ > + 0.13% get_continuation_for_phi > > ▒ > + 0.13% add_control_edge > > ▒ > + 0.13% df_md_bb_local_compute_process_def > > ▒ > + 0.13% mark_for_renaming > > ▒ > + 0.13% (anonymous namespace)::pass_rtl_cprop::execute > > ▒ > + 0.13% compute_dominance_frontiers > > ▒ > + 0.12% df_simulate_initialize_backwards > > it's also interesting that most branch misses (for bitmap_set_bit) > are from > > bool res = (ptr->bits[word_num] & bit_val) == 0; > if (res) > ptr->bits[word_num] |= bit_val; > return res; > > I'm not sure how "bad" a mispredicted branch is here. I guess > if it is predicted to be taken (skip the write) then it's bad > but if it is predicted the other way around it should be a matter > of not retiring the store... But I am not a CPU guy. I guess > unconditionally updating the memory wouldn't be so bad after all > and it might also help combine in using architecture specific > optimizations like using some CC flags of the OR operation > to get at the comparison result. Can you benchmark a difference > for this? > > Individual compiles of optabs.ii are too fast (<4s for me) to > make out overall performance changes and noise is quite high > for me as well (+-6%).
The noise is much lower on my box. Lucky me I guess :-) > It looks like you cannot even use perfs precise counters (Haswell here) > of retired instructions to measure differences reliably. > > That is, I cannot really reproduce your findings on optabs.ii... > for me, the best runtime of both patched and unpatched is the > same 3.22s. At least part of the difference I'm seeing seems to come from extra asserts like: gcc_assert (!dst->tree_form && !a->tree_form && !b->tree_form && !kill->tree_form); in which each test needs a separate test and branch, since it isn't valid to dereference "a" when dst->tree_form is true, etc. I see a measurable difference from changing all of the "foo->tree_form && bar->tree_form" tests to use "&", or making them all gcc_checking_asserts instead of gcc_asserts. Didn't really sound likely at first, but there are ~2.4 million calls to functions with those kinds of chained test (mostly with just 2 arguments), so it mounts up. BTW, in case it sounds otherwise, I'm not objecting to the patch. Looks like a nice thing to have and I don't think this should hold it up. Thanks, Richard