https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54052

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, just as expected:

Samples: 5M of event 'cycles:u', Event count (approx.): 8038399460939           
Overhead       Samples  Command  Shared Object       Symbol                     
  95.69%       5693007  cc1plus  cc1plus             [.] bitmap_ior_into
   0.62%         37954  cc1plus  cc1plus             [.]
get_last_value_validate
...

with

-   95.70%    95.67%       5717568  cc1plus  cc1plus             [.]
bitmap_ior_into                        #
     95.67% _start                                                             
                            #
        __libc_start_main                                                      
                            #
        main                                                                   
                            #
        toplev::main                                                           
                            #
        toplev::main                                                           
                            #
        compile_file                                                           
                            #
        symbol_table::finalize_compilation_unit                                
                            #
        symbol_table::compile (inlined)                                        
                            #
        symbol_table::compile                                                  
                            #
      - symbol_table::compile                                                  
                            #
         - 95.67% cgraph_node::expand                                          
                            #
              execute_pass_list                                                
                            #
            - execute_pass_list_1                                              
                            #
               - 95.67% execute_pass_list_1                                    
                            #
                  - 95.67% execute_one_pass                                    
                            #
                     - 95.66% fwprop                                           
                            #
                        - 95.66% fwprop_init (inlined)                         
                            #
                             rtl_ssa::function_info::function_info             
                            #
                           - rtl_ssa::function_info::process_all_blocks        
                            #
                              + 95.66% rtl_ssa::function_info::place_phis      
                            #

and it seems to be the

  // If block B1 defines R and if B2 is in the dominance frontier of B1,
  // queue a possible phi node for R in B2.
  auto_bitmap worklist;
  for (unsigned int b1 = 0; b1 < num_bb_indices; ++b1)
    {
      // Only access DF information for blocks that are known to exist.
      if (bitmap_empty_p (&frontiers[b1]))
        continue;

      bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, b1))->def;
      bitmap_iterator bmi;
      unsigned int b2;
      EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi)
        if (bitmap_ior_into (&unfiltered[b2], b1_def)
            && !bitmap_empty_p (&frontiers[b2]))
          // Propagate the (potential) new phi node definitions in B2.
          bitmap_set_bit (worklist, b2);
    }

loop.  This looks quadratic and I wonder if it is the best way to perform
PHI insertion ... possibly (since it seems to be some sort of propagation)
an improvement could be to work on RPO instead of BB index order, but
possibly this tries to compute the minimal set of PHIs required which
could be expensive - I don't remember GIMPLE SSA doing any sort of
"iteration" (but we do not compute all PHI insertions at the same time
but instead do it per variable).  The slowness (apart from the quadraticness)
might be as well because the working sets when processing all PHIs at the
same time is so large.

Reply via email to