Martin,
while looking into the ipa-cp dumps for bzip and Firefox I noticed few issues.
First of all, ipcp_cloning_candidate_p calls
 optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl))
which can not be used at WPA time, becuase we have no DECL_STRUCT_FUNCTION
around.  I replaced it by node->optimize_for_size_p ().

Second we perform incredible number of clones because we do obtain some sort of
polymorphic call context for them.  In wast majority of cases this is useless
effort, because the functions in question do not contain virtual calls and do
not pass the parameter further.  For firefox about 40k out of 50k clones
created are created just because we found some context.

I changed the code to only clone if this immediately leads to devirtualization.
This do not cause any noticeable drop in number of devirtualized calls on
Firefox. I suppose we will miss the case where cloning a caller may allow
devirtualization in a clone of callee, but I do not think the heuristics for
context independent values can handle this as implemented right now and it
simply have way to many false positives.

What we can do is to devirtualize w/o cloning for local functions and
speculatively devirtualize in case we would otherwise clone.

Third problem I noticed is that
will_be_removed_from_program_if_no_direct_calls_p is used to decide if we can
ignore the function size when deciding about the code size impact.
This function is doing some analysis for inliner where it, for example, analyses
if a comdat which is going to be inlined consistently in the whole program
will be removed.

In the cloning case I do not see this to apply: we have no evidence that the
other units will pass the same constants to the function.  I think you
basically want to assume that the  function will be removed if it has no
address taken and it is not externally visibible. This is what local flag
is for.

I gathered some stats:

number of clones for all contexts: 49948->11102
number of clones: 4376->4383

good_cloning_opportunity_p is called about 70k times, I wonder if the
thresholds are not simply set too high.  For example, inliner does about 300k
inlines at Firefox.

number of param replacements: 13041-> 13056 + 5383 aggregate replacements (I do 
not have data on unpatched tree for this)
number of devirts: 956->933
number of devirts happening at inline: 781->868
number of indirect calls promoted: 512->512

Inliner stats from: Unit growth for small function inlining: 7965701->9130051 
(14%)
to: Unit growth for small function inlining: 7965010->9138577

So it seems that except for large drop in number of clones there is no 
significant difference.

I am bootstrapping/regtesting this on x86_64-linux, does it seem OK?

Honza

        * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p.
        (good_cloning_opportunity_p): Likewise.
        (gather_context_independent_values): Do not return true when
        polymorphic call context is known or when we have known aggregate
        value of unused parameter.
        (estimate_local_effects): Try to create clone for all context
        when either some params are substituted or devirtualization is possible
        or some params can be removed; use local flag instead of
        node->will_be_removed_from_program_if_no_direct_calls_p.
        (identify_dead_nodes): Likewise.
Index: ipa-cp.c
===================================================================
--- ipa-cp.c    (revision 231477)
+++ ipa-cp.c    (working copy)
@@ -613,7 +613,7 @@ ipcp_cloning_candidate_p (struct cgraph_
       return false;
     }
 
-  if (!optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl)))
+  if (node->optimize_for_size_p ())
     {
       if (dump_file)
         fprintf (dump_file, "Not considering %s for cloning; "
@@ -2267,7 +2267,7 @@ good_cloning_opportunity_p (struct cgrap
 {
   if (time_benefit == 0
       || !opt_for_fn (node->decl, flag_ipa_cp_clone)
-      || !optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl)))
+      || node->optimize_for_size_p ())
     return false;
 
   gcc_assert (size_cost > 0);
@@ -2387,12 +2387,14 @@ gather_context_independent_values (struc
        *removable_params_cost
          += ipa_get_param_move_cost (info, i);
 
+      if (!ipa_is_param_used (info, i))
+       continue;
+
       ipcp_lattice<ipa_polymorphic_call_context> *ctxlat = &plats->ctxlat;
+      /* Do not account known context as reason for cloning.  We can see
+        if it permits devirtualization.  */
       if (ctxlat->is_single_const ())
-       {
-         (*known_contexts)[i] = ctxlat->values->value;
-         ret = true;
-       }
+       (*known_contexts)[i] = ctxlat->values->value;
 
       if (known_aggs)
        {
@@ -2494,7 +2496,9 @@ estimate_local_effects (struct cgraph_no
                                                    &known_contexts, 
&known_aggs,
                                                    &removable_params_cost);
   known_aggs_ptrs = agg_jmp_p_vec_for_t_vec (known_aggs);
-  if (always_const)
+  int devirt_bonus = devirtualization_time_bonus (node, known_csts,
+                                          known_contexts, known_aggs_ptrs);
+  if (always_const || devirt_bonus || removable_params_cost)
     {
       struct caller_statistics stats;
       inline_hints hints;
@@ -2505,8 +2509,7 @@ estimate_local_effects (struct cgraph_no
                                              false);
       estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
                                         known_aggs_ptrs, &size, &time, &hints);
-      time -= devirtualization_time_bonus (node, known_csts, known_contexts,
-                                          known_aggs_ptrs);
+      time -= devirt_bonus;
       time -= hint_time_bonus (hints);
       time -= removable_params_cost;
       size -= stats.n_calls * removable_params_cost;
@@ -2515,8 +2518,7 @@ estimate_local_effects (struct cgraph_no
        fprintf (dump_file, " - context independent values, size: %i, "
                 "time_benefit: %i\n", size, base_time - time);
 
-      if (size <= 0
-         || node->will_be_removed_from_program_if_no_direct_calls_p ())
+      if (size <= 0 || node->local.local)
        {
          info->do_clone_for_all_contexts = true;
          base_time = time;
@@ -2544,6 +2546,10 @@ estimate_local_effects (struct cgraph_no
                     "max_new_size would be reached with %li.\n",
                     size + overall_size);
        }
+      else if (dump_file && (dump_flags & TDF_DETAILS))
+       fprintf (dump_file, "   Not cloning for all contexts because "
+                "!good_cloning_opportunity_p.\n");
+       
     }
 
   for (i = 0; i < count ; i++)
@@ -4419,7 +4425,7 @@ identify_dead_nodes (struct cgraph_node
 {
   struct cgraph_node *v;
   for (v = node; v ; v = ((struct ipa_dfs_info *) v->aux)->next_cycle)
-    if (v->will_be_removed_from_program_if_no_direct_calls_p ()
+    if (v->local.local
        && !v->call_for_symbol_thunks_and_aliases
             (has_undead_caller_from_outside_scc_p, NULL, true))
       IPA_NODE_REF (v)->node_dead = 1;

Reply via email to