Hi!
For a tramp3d -O2 compile we still have
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
5.10 4.53 4.53 129961 0.00 0.00 cgraph_remove_node
2.93 7.13 2.60 28401225 0.00 0.00 ggc_alloc_stat
1.65 8.60 1.47 11131801 0.00 0.00 splay_tree_splay_helper
1.38 9.83 1.23 17375270 0.00 0.00 get_stmt_operands
1.34 11.02 1.19 13906935 0.00 0.00 htab_find_slot_with_hash
with is probably the loop(s) over all clones in cgraph_remove_node which results in all this being O(n^2) inlining all of the clones.
The obvious fix for the first loop, the removal of the node from the clones list, is to make this list doubly-linked, there is another loop just a few lines below with the same complexity:
for (n = *slot; n; n = n->next_clone)
if (n->global.inlined_to
|| (!n->global.inlined_to
&& !TREE_ASM_WRITTEN (n->decl) && !DECL_EXTERNAL (n->decl)))
break;
if (!n && !dump_enabled_p (TDI_tree_all))
{
DECL_SAVED_TREE (node->decl) = NULL;
DECL_STRUCT_FUNCTION (node->decl) = NULL;
DECL_INITIAL (node->decl) = error_mark_node;
}
which checks if we still need the function body. Why is this loop there
in the first place? Wouldn't it be enough to check this if the node is
the last clone? Why do we need the list of clones anyway? It looks like its purpose can be achieved with some reference counting, too?
Any ideas how to fix this remaining O(N^2) complexity for 4.0?
Thanks, Richard.