Martin, while looking into the ipa-cp dumps for bzip and Firefox I noticed few issues. First of all, ipcp_cloning_candidate_p calls optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl)) which can not be used at WPA time, becuase we have no DECL_STRUCT_FUNCTION around. I replaced it by node->optimize_for_size_p ().
Second we perform incredible number of clones because we do obtain some sort of polymorphic call context for them. In wast majority of cases this is useless effort, because the functions in question do not contain virtual calls and do not pass the parameter further. For firefox about 40k out of 50k clones created are created just because we found some context. I changed the code to only clone if this immediately leads to devirtualization. This do not cause any noticeable drop in number of devirtualized calls on Firefox. I suppose we will miss the case where cloning a caller may allow devirtualization in a clone of callee, but I do not think the heuristics for context independent values can handle this as implemented right now and it simply have way to many false positives. What we can do is to devirtualize w/o cloning for local functions and speculatively devirtualize in case we would otherwise clone. Third problem I noticed is that will_be_removed_from_program_if_no_direct_calls_p is used to decide if we can ignore the function size when deciding about the code size impact. This function is doing some analysis for inliner where it, for example, analyses if a comdat which is going to be inlined consistently in the whole program will be removed. In the cloning case I do not see this to apply: we have no evidence that the other units will pass the same constants to the function. I think you basically want to assume that the function will be removed if it has no address taken and it is not externally visibible. This is what local flag is for. I gathered some stats: number of clones for all contexts: 49948->11102 number of clones: 4376->4383 good_cloning_opportunity_p is called about 70k times, I wonder if the thresholds are not simply set too high. For example, inliner does about 300k inlines at Firefox. number of param replacements: 13041-> 13056 + 5383 aggregate replacements (I do not have data on unpatched tree for this) number of devirts: 956->933 number of devirts happening at inline: 781->868 number of indirect calls promoted: 512->512 Inliner stats from: Unit growth for small function inlining: 7965701->9130051 (14%) to: Unit growth for small function inlining: 7965010->9138577 So it seems that except for large drop in number of clones there is no significant difference. I am bootstrapping/regtesting this on x86_64-linux, does it seem OK? Honza * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p. (good_cloning_opportunity_p): Likewise. (gather_context_independent_values): Do not return true when polymorphic call context is known or when we have known aggregate value of unused parameter. (estimate_local_effects): Try to create clone for all context when either some params are substituted or devirtualization is possible or some params can be removed; use local flag instead of node->will_be_removed_from_program_if_no_direct_calls_p. (identify_dead_nodes): Likewise. Index: ipa-cp.c =================================================================== --- ipa-cp.c (revision 231477) +++ ipa-cp.c (working copy) @@ -613,7 +613,7 @@ ipcp_cloning_candidate_p (struct cgraph_ return false; } - if (!optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl))) + if (node->optimize_for_size_p ()) { if (dump_file) fprintf (dump_file, "Not considering %s for cloning; " @@ -2267,7 +2267,7 @@ good_cloning_opportunity_p (struct cgrap { if (time_benefit == 0 || !opt_for_fn (node->decl, flag_ipa_cp_clone) - || !optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl))) + || node->optimize_for_size_p ()) return false; gcc_assert (size_cost > 0); @@ -2387,12 +2387,14 @@ gather_context_independent_values (struc *removable_params_cost += ipa_get_param_move_cost (info, i); + if (!ipa_is_param_used (info, i)) + continue; + ipcp_lattice<ipa_polymorphic_call_context> *ctxlat = &plats->ctxlat; + /* Do not account known context as reason for cloning. We can see + if it permits devirtualization. */ if (ctxlat->is_single_const ()) - { - (*known_contexts)[i] = ctxlat->values->value; - ret = true; - } + (*known_contexts)[i] = ctxlat->values->value; if (known_aggs) { @@ -2494,7 +2496,9 @@ estimate_local_effects (struct cgraph_no &known_contexts, &known_aggs, &removable_params_cost); known_aggs_ptrs = agg_jmp_p_vec_for_t_vec (known_aggs); - if (always_const) + int devirt_bonus = devirtualization_time_bonus (node, known_csts, + known_contexts, known_aggs_ptrs); + if (always_const || devirt_bonus || removable_params_cost) { struct caller_statistics stats; inline_hints hints; @@ -2505,8 +2509,7 @@ estimate_local_effects (struct cgraph_no false); estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts, known_aggs_ptrs, &size, &time, &hints); - time -= devirtualization_time_bonus (node, known_csts, known_contexts, - known_aggs_ptrs); + time -= devirt_bonus; time -= hint_time_bonus (hints); time -= removable_params_cost; size -= stats.n_calls * removable_params_cost; @@ -2515,8 +2518,7 @@ estimate_local_effects (struct cgraph_no fprintf (dump_file, " - context independent values, size: %i, " "time_benefit: %i\n", size, base_time - time); - if (size <= 0 - || node->will_be_removed_from_program_if_no_direct_calls_p ()) + if (size <= 0 || node->local.local) { info->do_clone_for_all_contexts = true; base_time = time; @@ -2544,6 +2546,10 @@ estimate_local_effects (struct cgraph_no "max_new_size would be reached with %li.\n", size + overall_size); } + else if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " Not cloning for all contexts because " + "!good_cloning_opportunity_p.\n"); + } for (i = 0; i < count ; i++) @@ -4419,7 +4425,7 @@ identify_dead_nodes (struct cgraph_node { struct cgraph_node *v; for (v = node; v ; v = ((struct ipa_dfs_info *) v->aux)->next_cycle) - if (v->will_be_removed_from_program_if_no_direct_calls_p () + if (v->local.local && !v->call_for_symbol_thunks_and_aliases (has_undead_caller_from_outside_scc_p, NULL, true)) IPA_NODE_REF (v)->node_dead = 1;