https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #14)
> Created attachment 57829 [details]
> smaller testcase
> 
> Smaller testcase, shows the same compile-time issue at -O0.  At -O1 it's a
> lot
> less bad but memory usage is better (8GB), so the slowness of the full
> testcase
> is likely memory bandwidth related.
> 
> -O1 is then
> 
>  tree PTA                           :  20.59 ( 21%)
>  expand vars                        :   9.19 (  9%)
>  expand                             :  14.26 ( 15%)

The memory use goes into RTXen created during RTL expansion.  The compile-time
part is add_scope_conflicts.  There's the possibility to do like
var-tracking and use rev_post_order_and_mark_dfs_back_seme, avoiding iteration
for non-loops and have better cache locality.

We have half of the profile hits on ggc_internal_alloc and it's

    17 | d8:+- mov    %r14,%rax                                                
                     #
       |    |  mov    (%r14),%r14                                              
                     #
  1440 |    |  test   %r14,%r14                                                
                     #
     4 |    |  je     530                                                      
                     #
       |    |if (p->bytes == entry_size)                                       
                     #
       | e7:|  cmp    0x10(%r14),%r12                                          
                     #
 65582 |    +--jne    d8                   

which is the linear walk

  /* Check the list of free pages for one we can use.  */
  for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp) 
    if (p->bytes == entry_size)
      break;

so we seem to have many free pages for some reason but the free pages
pool is global and not per order?!

Samples: 299K of event 'cycles', Event count (approx.): 338413178083            
Overhead       Samples  Command  Shared Object       Symbol                     
  23.16%         67756  cc1plus  cc1plus             [.] ggc_internal_alloc
   6.98%         21637  cc1plus  cc1plus             [.] bitmap_tree_splay
   6.89%         20413  cc1plus  cc1plus             [.] bitmap_ior_into
   4.05%         11989  cc1plus  cc1plus             [.] bitmap_elt_ior
   3.16%          9840  cc1plus  cc1plus             [.] mergesort<sort_ctx>
   2.90%          8860  cc1plus  cc1plus             [.] bitmap_set_bit
   2.76%          8281  cc1plus  cc1plus             [.]
get_ref_base_and_extent
   1.37%          4071  cc1plus  cc1plus             [.]
stmt_may_clobber_ref_p_1
   1.32%          4095  cc1plus  cc1plus             [.] dominated_by_p
   1.16%          3597  cc1plus  cc1plus             [.]
bitmap_tree_unlink_element
   1.06%          3128  cc1plus  cc1plus             [.] walk_aliased_vdefs_1

the bitmap_tree_splay is from compute_idf, refactoring that some more,
also avoiding the duplicate processing and doing away with the bitmap
for the workset might help a bit there (not using tree view just gets
set-bit up with no overall positive change).

I will look into the above things more (but not the RA slowness at -O0).

Reply via email to