On Fri, 19 Oct 2018, Richard Biener wrote:

> On Fri, 19 Oct 2018, Richard Sandiford wrote:
> 
> > Richard Biener <rguent...@suse.de> writes:
> > > On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford
> > > <richard.sandif...@arm.com> wrote:
> > >>Richard Biener <rguent...@suse.de> writes:
> > >>> On Thu, 18 Oct 2018, Richard Sandiford wrote:
> > >>>> What's the performance like for more reasonable testcases?  E.g. is
> > >>there
> > >>>> a measurable change either way in --enable-checking=release for some
> > >>gcc
> > >>>> .iis at -g or -O2 -g?
> > >>>
> > >>> I did a quick check on my set of cc1files (still .i, .ii ones tend to
> > >>> be unusable quite quickly...).  Most of them compile too quickly
> > >>> to make any difference appear other than noise.  Multi-second
> > >>differences
> > >>> like for PR63155 should be the exception but our O(n) bitmap
> > >>> implementation really makes some parts of GCC quadratic where it
> > >>> doesn't appear so.
> > >>>
> > >>> Is there a reason you expect it to be ever slower?
> > >>
> > >>During recent stage3s I've tried to look at profiles of cc1plus
> > >>to see whether there was something easy we could do to improve
> > >>compile times.  And bitmap operations always showed up near the
> > >>top of the profile.  There were always some pathological queries
> > >>in which the linear search really hurt, but whenever I tried "simple"
> > >>ways to avoid the obvious cases, they made those queries quicker
> > >>but slowed things down overall.  It seemed that adding any extra logic
> > >>to the queries hurted because even a small slowdown in common lookups
> > >>overwhelmed a big saving in less common lookups.
> > >
> > > Yeah. I also noticed some 'obvious' shortcomings in the heuristics...
> > > I guess in the end well predicted branches in the out of line code are
> > > important...
> > >
> > >>
> > >>But there again I was looking to speed up more "normal" cases, not ones
> > >>like the PR.
> > >>
> > >>FWIW I've tried it on a local x86_64 box and it slows down current
> > >>optabs.ii at -O2 -g by ~0.35% (reproducable).  I see similar slowdowns
> > >>for the other files I tried.  But that's hardly a huge amount, and
> > >>probably a price worth paying for the speed-up in the PR.
> > >
> > > Just to make sure what to reproduce - this is with checking disabled?
> > > And with or without the hunks actually making use of the splay tree
> > > path?
> > 
> > Yeah, with an --enable-checking=release stage3:
> > 
> >    ./cc1plus optabs.ii -o /dev/null -O2 -g
> > 
> > using the optabs.ii from the unpatched --enable-checking=release build.
> > 
> > It was the whole patch vs. without the patch.
> 
> OK, so there are almost no hits from the SSA propagator or out-of-SSA
> but only from "unchanged" paths:
> 
> -    2.90%     2.90%            23  cc1plus  cc1plus           [.] 
> bitmap_set_b▒
>    - bitmap_set_bit                                                           
>  
> ▒
>       + 0.79% df_lr_bb_local_compute                                          
>  
> ▒
>       + 0.38% insert_updated_phi_nodes_for                                    
>  
> ▒
>       + 0.27% sched_analyze_reg                                               
>  
> ▒
>       + 0.23% walk_aliased_vdefs_1                                            
>  
> ▒
>       + 0.13% get_continuation_for_phi                                        
>  
> ▒
>       + 0.13% add_control_edge                                                
>  
> ▒
>       + 0.13% df_md_bb_local_compute_process_def                              
>  
> ▒
>       + 0.13% mark_for_renaming                                               
>  
> ▒
>       + 0.13% (anonymous namespace)::pass_rtl_cprop::execute                  
>  
> ▒
>       + 0.13% compute_dominance_frontiers                                     
>  
> ▒
>       + 0.12% df_simulate_initialize_backwards      
> 
> it's also interesting that most branch misses (for bitmap_set_bit)
> are from
> 
>       bool res = (ptr->bits[word_num] & bit_val) == 0;
>       if (res)
>         ptr->bits[word_num] |= bit_val;
>       return res;
> 
> I'm not sure how "bad" a mispredicted branch is here.  I guess
> if it is predicted to be taken (skip the write) then it's bad
> but if it is predicted the other way around it should be a matter
> of not retiring the store...  But I am not a CPU guy.  I guess
> unconditionally updating the memory wouldn't be so bad after all
> and it might also help combine in using architecture specific
> optimizations like using some CC flags of the OR operation
> to get at the comparison result.  Can you benchmark a difference
> for this?

So I can also see mispredicts on

  if (!head->tree_form)
    element = bitmap_list_find_element (head, indx, usage);
  else
    element = bitmap_tree_find_element (head, indx);

which I could significantly reduce via re-ordering (thus
w/o any branch history this currently gets predicted to the
tree variant).

That suggests to instead of the above runtime checks do
bitmap_tree_bit_p / bitmap_tree_set_bit routines (or do tricks
with C++ and overloading via a tbitmap type).

Would you be more happy with that approach?

Thanks,
Richard.

Reply via email to