Richard Biener <rguent...@suse.de> writes:
> On Fri, 19 Oct 2018, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford
>> > <richard.sandif...@arm.com> wrote:
>> >>During recent stage3s I've tried to look at profiles of cc1plus
>> >>to see whether there was something easy we could do to improve
>> >>compile times.  And bitmap operations always showed up near the
>> >>top of the profile.  There were always some pathological queries
>> >>in which the linear search really hurt, but whenever I tried "simple"
>> >>ways to avoid the obvious cases, they made those queries quicker
>> >>but slowed things down overall.  It seemed that adding any extra logic
>> >>to the queries hurted because even a small slowdown in common lookups
>> >>overwhelmed a big saving in less common lookups.
>> >
>> > Yeah. I also noticed some 'obvious' shortcomings in the heuristics...
>> > I guess in the end well predicted branches in the out of line code are
>> > important...
>> >
>> >>
>> >>But there again I was looking to speed up more "normal" cases, not ones
>> >>like the PR.
>> >>
>> >>FWIW I've tried it on a local x86_64 box and it slows down current
>> >>optabs.ii at -O2 -g by ~0.35% (reproducable).  I see similar slowdowns
>> >>for the other files I tried.  But that's hardly a huge amount, and
>> >>probably a price worth paying for the speed-up in the PR.
>> >
>> > Just to make sure what to reproduce - this is with checking disabled?
>> > And with or without the hunks actually making use of the splay tree
>> > path?
>> 
>> Yeah, with an --enable-checking=release stage3:
>> 
>>    ./cc1plus optabs.ii -o /dev/null -O2 -g
>> 
>> using the optabs.ii from the unpatched --enable-checking=release build.
>> 
>> It was the whole patch vs. without the patch.
>
> OK, so there are almost no hits from the SSA propagator or out-of-SSA
> but only from "unchanged" paths:
>
> -    2.90%     2.90%            23  cc1plus  cc1plus           [.] 
> bitmap_set_b▒
>    - bitmap_set_bit                                                           
>  
> ▒
>       + 0.79% df_lr_bb_local_compute                                          
>  
> ▒
>       + 0.38% insert_updated_phi_nodes_for                                    
>  
> ▒
>       + 0.27% sched_analyze_reg                                               
>  
> ▒
>       + 0.23% walk_aliased_vdefs_1                                            
>  
> ▒
>       + 0.13% get_continuation_for_phi                                        
>  
> ▒
>       + 0.13% add_control_edge                                                
>  
> ▒
>       + 0.13% df_md_bb_local_compute_process_def                              
>  
> ▒
>       + 0.13% mark_for_renaming                                               
>  
> ▒
>       + 0.13% (anonymous namespace)::pass_rtl_cprop::execute                  
>  
> ▒
>       + 0.13% compute_dominance_frontiers                                     
>  
> ▒
>       + 0.12% df_simulate_initialize_backwards      
>
> it's also interesting that most branch misses (for bitmap_set_bit)
> are from
>
>       bool res = (ptr->bits[word_num] & bit_val) == 0;
>       if (res)
>         ptr->bits[word_num] |= bit_val;
>       return res;
>
> I'm not sure how "bad" a mispredicted branch is here.  I guess
> if it is predicted to be taken (skip the write) then it's bad
> but if it is predicted the other way around it should be a matter
> of not retiring the store...  But I am not a CPU guy.  I guess
> unconditionally updating the memory wouldn't be so bad after all
> and it might also help combine in using architecture specific
> optimizations like using some CC flags of the OR operation
> to get at the comparison result.  Can you benchmark a difference
> for this?
>
> Individual compiles of optabs.ii are too fast (<4s for me) to
> make out overall performance changes and noise is quite high
> for me as well (+-6%).

The noise is much lower on my box.  Lucky me I guess :-)

> It looks like you cannot even use perfs precise counters (Haswell here)
> of retired instructions to measure differences reliably.
>
> That is, I cannot really reproduce your findings on optabs.ii...
> for me, the best runtime of both patched and unpatched is the
> same 3.22s.

At least part of the difference I'm seeing seems to come from extra
asserts like:

  gcc_assert (!dst->tree_form && !a->tree_form && !b->tree_form
              && !kill->tree_form);

in which each test needs a separate test and branch, since it isn't valid
to dereference "a" when dst->tree_form is true, etc.  I see a measurable
difference from changing all of the "foo->tree_form && bar->tree_form"
tests to use "&", or making them all gcc_checking_asserts instead of
gcc_asserts.

Didn't really sound likely at first, but there are ~2.4 million
calls to functions with those kinds of chained test (mostly with
just 2 arguments), so it mounts up.

BTW, in case it sounds otherwise, I'm not objecting to the patch.
Looks like a nice thing to have and I don't think this should hold it up.

Thanks,
Richard

Reply via email to