https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #14) > Created attachment 57829 [details] > smaller testcase > > Smaller testcase, shows the same compile-time issue at -O0. At -O1 it's a > lot > less bad but memory usage is better (8GB), so the slowness of the full > testcase > is likely memory bandwidth related. > > -O1 is then > > tree PTA : 20.59 ( 21%) > expand vars : 9.19 ( 9%) > expand : 14.26 ( 15%) The memory use goes into RTXen created during RTL expansion. The compile-time part is add_scope_conflicts. There's the possibility to do like var-tracking and use rev_post_order_and_mark_dfs_back_seme, avoiding iteration for non-loops and have better cache locality. We have half of the profile hits on ggc_internal_alloc and it's 17 | d8:+- mov %r14,%rax # | | mov (%r14),%r14 # 1440 | | test %r14,%r14 # 4 | | je 530 # | |if (p->bytes == entry_size) # | e7:| cmp 0x10(%r14),%r12 # 65582 | +--jne d8 which is the linear walk /* Check the list of free pages for one we can use. */ for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp) if (p->bytes == entry_size) break; so we seem to have many free pages for some reason but the free pages pool is global and not per order?! Samples: 299K of event 'cycles', Event count (approx.): 338413178083 Overhead Samples Command Shared Object Symbol 23.16% 67756 cc1plus cc1plus [.] ggc_internal_alloc 6.98% 21637 cc1plus cc1plus [.] bitmap_tree_splay 6.89% 20413 cc1plus cc1plus [.] bitmap_ior_into 4.05% 11989 cc1plus cc1plus [.] bitmap_elt_ior 3.16% 9840 cc1plus cc1plus [.] mergesort<sort_ctx> 2.90% 8860 cc1plus cc1plus [.] bitmap_set_bit 2.76% 8281 cc1plus cc1plus [.] get_ref_base_and_extent 1.37% 4071 cc1plus cc1plus [.] stmt_may_clobber_ref_p_1 1.32% 4095 cc1plus cc1plus [.] dominated_by_p 1.16% 3597 cc1plus cc1plus [.] bitmap_tree_unlink_element 1.06% 3128 cc1plus cc1plus [.] walk_aliased_vdefs_1 the bitmap_tree_splay is from compute_idf, refactoring that some more, also avoiding the duplicate processing and doing away with the bitmap for the workset might help a bit there (not using tree view just gets set-bit up with no overall positive change). I will look into the above things more (but not the RA slowness at -O0).