Make ipa-cp propagate over non-hot calls

Jan Hubicka Thu, 03 Apr 2025 06:04:07 -0700

Hi,
Currently enabling profile feedback regresses x264 and exchange. In both cases 
the root of the
issue is that ipa-cp cost model thinks cloning is not relevant when feedback is 
available
while it clones without feedback.


Consider:

__attribute__ ((used))
int a[1000];

__attribute__ ((noinline))
void
test2(int sz)
{
  for (int i = 0; i < sz; i++)
          a[i]++;
  asm volatile (""::"m"(a));
}

__attribute__ ((noinline))
void
test1 (int sz)
{
  for (int i = 0; i < 1000; i++)
          test2(sz);
}
int main()
{
        test1(1000);
        return 0;
}

Here we want to clone call both test1 and test2 and specialize for 1000, but
ipa-cp will not do that, since it will skip call main->test1 as not hot since
it is called just once both with or without profile feedback.
In this simple testcase even without profile feedback we will track that main
is called once.

I think the testcase shows that hotness of call is not that relevant when
deciding whether we want to propagate constants across it.  ipa-cp with IPA
profile can compute overall estimate of time saved (which is existing time
benefit computing time saved per invociation of the function multiplied by
number of executions) and see if result is big enough. An easy check is to
simply call maybe_hot_p on the resulting count. 

So this patch makes ipa-cp to consider all calls sites except those known to be
unlikely executed (i.e. run 0 times in train run or known to lead to someting
bad) as interesting, which makes ipa-cp to propagate across them, find cloning
candidates and feed them into good_clonning_oppurtunity.  

For this I added cs_interesting_for_ipcp_p which also attempts to do right
thing with partial training.

Now good_clonning_oppurtunity will currently return false, since it will figure
out that the call edge is not very frequent.
It already kind of knows that frequency of call instruction istself is not too
important, but instead of computing overall time saved, it tries to compare it
with param_ipa_cp_profile_count_base percentage of counts of call edges.  I
think this is not very relevant since estimated time saved per call can be
large.  So I dropped this logic and replaced it with simple use of overall
saved time.

Since ipa-cp is not dealing well with the cases where it hits the allowed unit
growth limit, we probably want to be more careful, so I keep existing metric
with this change.

So now we get:

Evaluating opportunities for test1/3.
 - considering value 1000 for param #0 sz (caller_count: 1)
     good_cloning_opportunity_p (time: 1, size: 8, count_sum: 1 (precise), 
overall time saved: 1 (adjusted)) -> evaluation: 0.12, threshold: 500
     not cloning: time saved is not hot
     good_cloning_opportunity_p (time: 129001, size: 20, count_sum: 1 
(precise), overall time saved: 129001 (adjusted)) -> evaluation: 6450.05, 
threshold: 500

First call to good_cloning_oppurtunity considers the case where only test1 is
clonned. In this case time saved is 1 (for passing the value around) and since
it is called just once (count_sum) overall time saved is 1 which is not
considered hot and we also get very low evaulation score.

In the second call we consider cloning chain test1->test2.  In this case time
saved is large (12901) since test2 is invoked many times and it is used to
controll the loop.  We still know that the count is 1 but overall time is
129001 which is already considered relevant and we clone.

I also try to do something sensible in case we have calls both with
and without IPA profile (which can happen for comdats where profile got missing
or with LTO if some units were not trained).
Instead of checking whether sum of calls with known profile is nonzero, I keep
track if there are other calls and if so, also try the local heuristics that
is used without profile feedback.

The patch improves SPECint with -Ofast -fprofile-use by approx 1% by speeding
up x264 from 99.3s to 91.3s (9%) and exchange from 99.7s to 95.5s (3.3%).

We still get better x264 runtime without profile (86.4s for x264 and 93.8 for 
exchange).

The main problem I see is that ipa-cp has the global limit for growth of 10%
but does not consider the oppurtunities in priority order.  Consequently if the
limit is hit, randomly some clone oppurtunities are dropped in favour of
others.

I dumped unit size changes with -flto -Ofast build of SPEC2017. Without patch I 
get:

orig    new     growth
588677  605385  102.838229      
4378    6037    137.894016      
484650  494851  102.104818      
4111    4111    100.000000      
99953   103519  103.567677      
106181  114889  108.201091      
21389   21597   100.972462      
24925   26746   107.305918      
15308   23974   156.610922      
27354   27906   102.017986      
494     494     100.000000      
4631    4631    100.000000      
863216  872729  101.102042      
126604  126604  100.000000      
605138  627156  103.638509      
4112    4112    100.000000      
222006  231293  104.183220      
2952    3384    114.634146      
37584   39807   105.914751      
4111    4111    100.000000      
13226   13226   100.000000      
4111    4111    100.000000      
326215  337396  103.427494      
25240   25433   100.764659      
64644   65972   102.054328      
127223  132300  103.990631      
494     494     100.000000      

Small units can grow up to 16000 instructions and other units are
large. So there is only one 156% growth hititng limits which is exchange
that has recursive clonning that goes specially.

With profile feedback ipacp basically shuts itself off:

333815  333891  100.022767
2559    2974    116.217272
217576  217581  100.002298
2749    2749    100.000000
64652   64716   100.098992
68416   69707   101.886986
13171   13171   100.000000
11849   11849   100.000000
10519   16180   153.816903
15843   15843   100.000000
231     231     100.000000
3624    3624    100.000000
573385  573386  100.000174
97623   97623   100.000000
295673  295676  100.001015
2750    2750    100.000000
130723  130726  100.002295
2334    2334    100.000000
19313   19313   100.000000
2749    2749    100.000000
517331  517331  100.000000
6707    6707    100.000000
2749    2749    100.000000
193638  193638  100.000000
16425   16425   100.000000
47154   47154   100.000000
96422   96422   100.000000
231     231     100.000000

So we essentially clone only exchange and and mcf (116%)
With patch and no FDO I get:

588677  605385  102.838229
4378    6037    137.894016
484519  494698  102.100846
4111    4111    100.000000
99953   103519  103.567677
106181  114889  108.201091
21389   22632   105.811398
24854   26620   107.105496
15308   23974   156.610922
27354   28039   102.504204
494     494     100.000000
4631    4631    100.000000
4631    4631    100.000000
126604  126630  100.020536
4112    4112    100.000000
222006  231293  104.183220
2952    3384    114.634146
37584   39807   105.914751
2760715 2835539 102.710312
4111    4111    100.000000
13226   13226   100.000000
4111    4111    100.000000
326215  337396  103.427494
25240   25433   100.764659
64644   65972   102.054328
127223  132300  103.990631
494     494     100.000000

which seems essentially same as without patch. However with FDO I get:
333815  350363  104.957237
2559    3345    130.715123
217469  220765  101.515618
485599  488772  100.653420
2749    2749    100.000000
64652   74265   114.868836
68416   87484   127.870674
13171   20656   156.829398
11792   11990   101.679104
10519   17028   161.878506
15843   16119   101.742094
231     231     100.000000
573336  573336  100.000000
97623   97623   100.000000
295497  296208  100.240612
2750    2750    100.000000
130723  133341  102.002708
2334    2334    100.000000
19313   19368   100.284782
2749    2749    100.000000
6707    6755    100.715670
2749    2749    100.000000
193638  194712  100.554643
16425   17377   105.796043
47154   47154   100.000000
96422   96422   100.000000
231     231     100.000000

So here we get 114% and 127 growth in x264 (two differen tbinaries)
56% growht in Deepsjeng, 61% growth in Exchange which all are above
10% cutoff.  

Bootstrapped/regtested x86_64-linux.  

gcc/ChangeLog:

        * ipa-cp.cc (base_count): Remove.
        (struct caller_statistics): Rename n_hot_calls to n_interesting_calls;
        add called_without_ipa_profile.
        (init_caller_stats): Update.
        (cs_interesting_for_ipcp_p): New function.
        (gather_caller_stats): collect n_interesting_calls and
        called_without_profile.
        (ipcp_cloning_candidate_p): Use n_interesting-calls rather then hot.
        (good_cloning_opportunity_p): Rewrite heuristics when IPA profile is
        present
        (estimate_local_effects): Update.
        (value_topo_info::propagate_effects): Update.
        (compare_edge_profile_counts): Remove.
        (ipcp_propagate_stage): Do not collect base_count.
        (get_info_about_necessary_edges): Record whether function is called
        without profile.
        (decide_about_value): Update.
        (ipa_cp_cc_finalize): Do not initialie base_count.
        * profile-count.cc (profile_count::operator*): New.
        (profile_count::operator*=): New.
        * profile-count.h (profile_count::operator*): Declare
        (profile_count::operator*=): Declare.
        * params.opt: Remove ipa-cp-profile-count-base.
        * doc/invoke.texi: Likewise.


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4fbb4cda101..1ef24215002 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16795,11 +16795,6 @@ Maximum depth of recursive cloning for self-recursive 
function.
 Recursive cloning only when the probability of call being executed exceeds
 the parameter.
 
-@item ipa-cp-profile-count-base
-When using @option{-fprofile-use} option, IPA-CP will consider the measured
-execution count of a call graph edge at this percentage position in their
-histogram as the basis for its heuristics calculation.
-
 @item ipa-cp-recursive-freq-factor
 The number of times interprocedural copy propagation expects recursive
 functions to call themselves.
diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
index 2347a5fcefc..a276e22d056 100644
--- a/gcc/ipa-cp.cc
+++ b/gcc/ipa-cp.cc
@@ -147,10 +147,6 @@ object_allocator<ipcp_value_source<tree> > 
ipcp_sources_pool
 object_allocator<ipcp_agg_lattice> ipcp_agg_lattice_pool
   ("IPA_CP aggregate lattices");
 
-/* Base count to use in heuristics when using profile feedback.  */
-
-static profile_count base_count;
-
 /* Original overall size of the program.  */
 
 static long overall_size, orig_overall_size;
@@ -488,14 +484,16 @@ struct caller_statistics
   profile_count count_sum;
   /* Sum of all frequencies for all calls.  */
   sreal freq_sum;
-  /* Number of calls and hot calls respectively.  */
-  int n_calls, n_hot_calls;
+  /* Number of calls and calls considered interesting respectively.  */
+  int n_calls, n_interesting_calls;
   /* If itself is set up, also count the number of non-self-recursive
      calls.  */
   int n_nonrec_calls;
   /* If non-NULL, this is the node itself and calls from it should have their
      counts included in rec_count_sum and not count_sum.  */
   cgraph_node *itself;
+  /* True if there is a caller that has no IPA profile.  */
+  bool called_without_ipa_profile;
 };
 
 /* Initialize fields of STAT to zeroes and optionally set it up so that edges
@@ -507,10 +505,39 @@ init_caller_stats (caller_statistics *stats, cgraph_node 
*itself = NULL)
   stats->rec_count_sum = profile_count::zero ();
   stats->count_sum = profile_count::zero ();
   stats->n_calls = 0;
-  stats->n_hot_calls = 0;
+  stats->n_interesting_calls = 0;
   stats->n_nonrec_calls = 0;
   stats->freq_sum = 0;
   stats->itself = itself;
+  stats->called_without_ipa_profile = false;
+}
+
+/* We want to propagate across edges that may be executed, however
+   we do not want to check maybe_hot, since call itself may be cold
+   while calee contains some heavy loop which makes propagation still
+   relevant.
+
+   In particular, even edge called once may lead to significant
+   improvement.  */
+
+static bool
+cs_interesting_for_ipcp_p (cgraph_edge *e)
+{
+  /* If profile says the edge is executed, we want to optimize.  */
+  if (e->count.ipa ().nonzero_p ())
+    return true;
+  /* If local (possibly guseed or adjusted 0 profile) claims edge is
+     not executed, do not propagate.  */
+  if (!e->count.nonzero_p ())
+    return false;
+  /* If IPA profile says edge is executed zero times, but zero
+     is quality is ADJUSTED, still consider it for cloning in
+     case we have partial training.  */
+  if (e->count.ipa ().initialized_p ()
+      && opt_for_fn (e->callee->decl,flag_profile_partial_training)
+      && e->count.nonzero_p ())
+    return false;
+  return true;
 }
 
 /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of
@@ -536,13 +563,18 @@ gather_caller_stats (struct cgraph_node *node, void *data)
            else
              stats->count_sum += cs->count.ipa ();
          }
+       else
+         stats->called_without_ipa_profile = true;
        stats->freq_sum += cs->sreal_frequency ();
        stats->n_calls++;
        if (stats->itself && stats->itself != cs->caller)
          stats->n_nonrec_calls++;
 
-       if (cs->maybe_hot_p ())
-         stats->n_hot_calls ++;
+       /* If profile known to be zero, we do not want to clone for performance.
+          However if call is cold, the called function may still contain
+          important hot loops.  */
+       if (cs_interesting_for_ipcp_p (cs))
+         stats->n_interesting_calls++;
       }
   return false;
 
@@ -585,26 +617,11 @@ ipcp_cloning_candidate_p (struct cgraph_node *node)
                 node->dump_name ());
       return true;
     }
-
-  /* When profile is available and function is hot, propagate into it even if
-     calls seems cold; constant propagation can improve function's speed
-     significantly.  */
-  if (stats.count_sum > profile_count::zero ()
-      && node->count.ipa ().initialized_p ())
-    {
-      if (stats.count_sum > node->count.ipa ().apply_scale (90, 100))
-       {
-         if (dump_file)
-           fprintf (dump_file, "Considering %s for cloning; "
-                    "usually called directly.\n",
-                    node->dump_name ());
-         return true;
-       }
-    }
-  if (!stats.n_hot_calls)
+  if (!stats.n_interesting_calls)
     {
       if (dump_file)
-       fprintf (dump_file, "Not considering %s for cloning; no hot calls.\n",
+       fprintf (dump_file, "Not considering %s for cloning; "
+                "no calls considered interesting by profile.\n",
                 node->dump_name ());
       return false;
     }
@@ -3361,24 +3378,29 @@ incorporate_penalties (cgraph_node *node, 
ipa_node_params *info,
 static bool
 good_cloning_opportunity_p (struct cgraph_node *node, sreal time_benefit,
                            sreal freq_sum, profile_count count_sum,
-                           int size_cost)
+                           int size_cost, bool called_without_ipa_profile)
 {
+  gcc_assert (count_sum.ipa () == count_sum);
   if (time_benefit == 0
       || !opt_for_fn (node->decl, flag_ipa_cp_clone)
-      || node->optimize_for_size_p ())
+      || node->optimize_for_size_p ()
+      /* If there is no call which was executed in profiling or where
+        profile is missing, we do not want to clone.  */
+      || (!called_without_ipa_profile && !count_sum.nonzero_p ()))
     return false;
 
   gcc_assert (size_cost > 0);
 
   ipa_node_params *info = ipa_node_params_sum->get (node);
   int eval_threshold = opt_for_fn (node->decl, param_ipa_cp_eval_threshold);
+  /* If we know the execution IPA execution counts, we can estimate overall
+     speedup of the program.  */
   if (count_sum.nonzero_p ())
     {
-      gcc_assert (base_count.nonzero_p ());
-      sreal factor = count_sum.probability_in (base_count).to_sreal ();
-      sreal evaluation = (time_benefit * factor) / size_cost;
+      profile_count saved_time = count_sum * time_benefit;
+      sreal evaluation = saved_time.to_sreal_scale (profile_count::one ())
+                             / size_cost;
       evaluation = incorporate_penalties (node, info, evaluation);
-      evaluation *= 1000;
 
       if (dump_file && (dump_flags & TDF_DETAILS))
        {
@@ -3386,33 +3408,46 @@ good_cloning_opportunity_p (struct cgraph_node *node, 
sreal time_benefit,
                   "size: %i, count_sum: ", time_benefit.to_double (),
                   size_cost);
          count_sum.dump (dump_file);
+         fprintf (dump_file, ", overall time saved: ");
+         saved_time.dump (dump_file);
          fprintf (dump_file, "%s%s) -> evaluation: %.2f, threshold: %i\n",
                 info->node_within_scc
                   ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
                 info->node_calling_single_call ? ", single_call" : "",
                   evaluation.to_double (), eval_threshold);
        }
-
-      return evaluation.to_int () >= eval_threshold;
+      gcc_checking_assert (saved_time == saved_time.ipa ());
+      if (!maybe_hot_count_p (NULL, saved_time))
+       {
+         if (dump_file && (dump_flags & TDF_DETAILS))
+           fprintf (dump_file, "     not cloning: time saved is not hot\n");
+       }
+      /* Evaulation approximately corresponds to time saved per instruction
+        introduced.  This is likely almost always going to be true, since we
+        already checked that time saved is large enough to be considered
+        hot.  */
+      else if (evaluation.to_int () >= eval_threshold)
+       return true;
+      /* If all call sites have profile known; we know we do not want t clone.
+        If there are calls with unknown profile; try local heuristics.  */
+      if (!called_without_ipa_profile)
+       return false;
     }
-  else
-    {
-      sreal evaluation = (time_benefit * freq_sum) / size_cost;
-      evaluation = incorporate_penalties (node, info, evaluation);
-      evaluation *= 1000;
+  sreal evaluation = (time_benefit * freq_sum) / size_cost;
+  evaluation = incorporate_penalties (node, info, evaluation);
+  evaluation *= 1000;
 
-      if (dump_file && (dump_flags & TDF_DETAILS))
-       fprintf (dump_file, "     good_cloning_opportunity_p (time: %g, "
-                "size: %i, freq_sum: %g%s%s) -> evaluation: %.2f, "
-                "threshold: %i\n",
-                time_benefit.to_double (), size_cost, freq_sum.to_double (),
-                info->node_within_scc
-                  ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
-                info->node_calling_single_call ? ", single_call" : "",
-                evaluation.to_double (), eval_threshold);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "     good_cloning_opportunity_p (time: %g, "
+            "size: %i, freq_sum: %g%s%s) -> evaluation: %.2f, "
+            "threshold: %i\n",
+            time_benefit.to_double (), size_cost, freq_sum.to_double (),
+            info->node_within_scc
+              ? (info->node_is_self_scc ? ", self_scc" : ", scc") : "",
+            info->node_calling_single_call ? ", single_call" : "",
+            evaluation.to_double (), eval_threshold);
 
-      return evaluation.to_int () >= eval_threshold;
-    }
+  return evaluation.to_int () >= eval_threshold;
 }
 
 /* Grow vectors in AVALS and fill them with information about values of
@@ -3605,7 +3640,8 @@ estimate_local_effects (struct cgraph_node *node)
                     "known contexts, code not going to grow.\n");
        }
       else if (good_cloning_opportunity_p (node, time, stats.freq_sum,
-                                          stats.count_sum, size))
+                                          stats.count_sum, size,
+                                          stats.called_without_ipa_profile))
        {
          if (size + overall_size <= get_max_overall_size (node))
            {
@@ -3971,7 +4007,7 @@ value_topo_info<valtype>::propagate_effects ()
          processed_srcvals.empty ();
          for (src = val->sources; src; src = src->next)
            if (src->val
-               && src->cs->maybe_hot_p ())
+               && cs_interesting_for_ipcp_p (src->cs))
              {
                if (!processed_srcvals.add (src->val))
                  {
@@ -4016,21 +4052,6 @@ value_topo_info<valtype>::propagate_effects ()
     }
 }
 
-/* Callback for qsort to sort counts of all edges.  */
-
-static int
-compare_edge_profile_counts (const void *a, const void *b)
-{
-  const profile_count *cnt1 = (const profile_count *) a;
-  const profile_count *cnt2 = (const profile_count *) b;
-
-  if (*cnt1 < *cnt2)
-    return 1;
-  if (*cnt1 > *cnt2)
-    return -1;
-  return 0;
-}
-
 
 /* Propagate constants, polymorphic contexts and their effects from the
    summaries interprocedurally.  */
@@ -4043,10 +4064,6 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
   if (dump_file)
     fprintf (dump_file, "\n Propagating constants:\n\n");
 
-  base_count = profile_count::uninitialized ();
-
-  bool compute_count_base = false;
-  unsigned base_count_pos_percent = 0;
   FOR_EACH_DEFINED_FUNCTION (node)
   {
     if (node->has_gimple_body_p ()
@@ -4063,57 +4080,8 @@ ipcp_propagate_stage (class ipa_topo_info *topo)
     ipa_size_summary *s = ipa_size_summaries->get (node);
     if (node->definition && !node->alias && s != NULL)
       overall_size += s->self_size;
-    if (node->count.ipa ().initialized_p ())
-      {
-       compute_count_base = true;
-       unsigned pos_percent = opt_for_fn (node->decl,
-                                          param_ipa_cp_profile_count_base);
-       base_count_pos_percent = MAX (base_count_pos_percent, pos_percent);
-      }
   }
 
-  if (compute_count_base)
-    {
-      auto_vec<profile_count> all_edge_counts;
-      all_edge_counts.reserve_exact (symtab->edges_count);
-      FOR_EACH_DEFINED_FUNCTION (node)
-       for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
-         {
-           profile_count count = cs->count.ipa ();
-           if (!count.nonzero_p ())
-             continue;
-
-           enum availability avail;
-           cgraph_node *tgt
-             = cs->callee->function_or_virtual_thunk_symbol (&avail);
-           ipa_node_params *info = ipa_node_params_sum->get (tgt);
-           if (info && info->versionable)
-             all_edge_counts.quick_push (count);
-         }
-
-      if (!all_edge_counts.is_empty ())
-       {
-         gcc_assert (base_count_pos_percent <= 100);
-         all_edge_counts.qsort (compare_edge_profile_counts);
-
-         unsigned base_count_pos
-           = ((all_edge_counts.length () * (base_count_pos_percent)) / 100);
-         base_count = all_edge_counts[base_count_pos];
-
-         if (dump_file)
-           {
-             fprintf (dump_file, "\nSelected base_count from %u edges at "
-                      "position %u, arriving at: ", all_edge_counts.length (),
-                      base_count_pos);
-             base_count.dump (dump_file);
-             fprintf (dump_file, "\n");
-           }
-       }
-      else if (dump_file)
-       fprintf (dump_file, "\nNo candidates with non-zero call count found, "
-                "continuing as if without profile feedback.\n");
-    }
-
   orig_overall_size = overall_size;
 
   if (dump_file)
@@ -4375,15 +4343,17 @@ static bool
 get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest,
                                sreal *freq_sum, int *caller_count,
                                profile_count *rec_count_sum,
-                               profile_count *nonrec_count_sum)
+                               profile_count *nonrec_count_sum,
+                               bool *called_without_ipa_profile)
 {
   ipcp_value_source<valtype> *src;
   sreal freq = 0;
   int count = 0;
   profile_count rec_cnt = profile_count::zero ();
   profile_count nonrec_cnt = profile_count::zero ();
-  bool hot = false;
+  bool interesting = false;
   bool non_self_recursive = false;
+  *called_without_ipa_profile = false;
 
   for (src = val->sources; src; src = src->next)
     {
@@ -4394,15 +4364,19 @@ get_info_about_necessary_edges (ipcp_value<valtype> 
*val, cgraph_node *dest,
            {
              count++;
              freq += cs->sreal_frequency ();
-             hot |= cs->maybe_hot_p ();
+             interesting |= cs_interesting_for_ipcp_p (cs);
              if (cs->caller != dest)
                {
                  non_self_recursive = true;
                  if (cs->count.ipa ().initialized_p ())
                    rec_cnt += cs->count.ipa ();
+                 else
+                   *called_without_ipa_profile = true;
                }
              else if (cs->count.ipa ().initialized_p ())
                nonrec_cnt += cs->count.ipa ();
+             else
+               *called_without_ipa_profile = true;
            }
          cs = get_next_cgraph_edge_clone (cs);
        }
@@ -4418,19 +4392,7 @@ get_info_about_necessary_edges (ipcp_value<valtype> 
*val, cgraph_node *dest,
   *rec_count_sum = rec_cnt;
   *nonrec_count_sum = nonrec_cnt;
 
-  if (!hot && ipa_node_params_sum->get (dest)->node_within_scc)
-    {
-      struct cgraph_edge *cs;
-
-      /* Cold non-SCC source edge could trigger hot recursive execution of
-        function. Consider the case as hot and rely on following cost model
-        computation to further select right one.  */
-      for (cs = dest->callers; cs; cs = cs->next_caller)
-       if (cs->caller == dest && cs->maybe_hot_p ())
-         return true;
-    }
-
-  return hot;
+  return interesting;
 }
 
 /* Given a NODE, and a set of its CALLERS, try to adjust order of the callers
@@ -5914,6 +5876,7 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
   sreal freq_sum;
   profile_count count_sum, rec_count_sum;
   vec<cgraph_edge *> callers;
+  bool called_without_ipa_profile;
 
   if (val->spec_node)
     {
@@ -5929,7 +5892,8 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
       return false;
     }
   else if (!get_info_about_necessary_edges (val, node, &freq_sum, 
&caller_count,
-                                           &rec_count_sum, &count_sum))
+                                           &rec_count_sum, &count_sum,
+                                           &called_without_ipa_profile))
     return false;
 
   if (!dbg_cnt (ipa_cp_values))
@@ -5966,9 +5930,11 @@ decide_about_value (struct cgraph_node *node, int index, 
HOST_WIDE_INT offset,
 
   if (!good_cloning_opportunity_p (node, val->local_time_benefit,
                                   freq_sum, count_sum,
-                                  val->local_size_cost)
+                                  val->local_size_cost,
+                                  called_without_ipa_profile)
       && !good_cloning_opportunity_p (node, val->prop_time_benefit,
-                                     freq_sum, count_sum, val->prop_size_cost))
+                                     freq_sum, count_sum, val->prop_size_cost,
+                                     called_without_ipa_profile))
     return false;
 
   if (dump_file)
@@ -6550,7 +6516,6 @@ make_pass_ipa_cp (gcc::context *ctxt)
 void
 ipa_cp_cc_finalize (void)
 {
-  base_count = profile_count::uninitialized ();
   overall_size = 0;
   orig_overall_size = 0;
   ipcp_free_transformation_sum ();
diff --git a/gcc/params.opt b/gcc/params.opt
index 4f4eb4d7a2a..6f5d940dc95 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -273,10 +273,6 @@ The size of translation unit that IPA-CP pass considers 
large.
 Common Joined UInteger Var(param_ipa_cp_value_list_size) Init(8) Param 
Optimization
 Maximum size of a list of values associated with each parameter for 
interprocedural constant propagation.
 
--param=ipa-cp-profile-count-base=
-Common Joined UInteger Var(param_ipa_cp_profile_count_base) Init(10) 
IntegerRange(0, 100) Param Optimization
-When using profile feedback, use the edge at this percentage position in 
frequency histogram as the bases for IPA-CP heuristics.
-
 -param=ipa-jump-function-lookups=
 Common Joined UInteger Var(param_ipa_jump_function_lookups) Init(8) Param 
Optimization
 Maximum number of statements visited during jump function offset discovery.
diff --git a/gcc/profile-count.cc b/gcc/profile-count.cc
index 8b9d8e18c51..374f06f4c08 100644
--- a/gcc/profile-count.cc
+++ b/gcc/profile-count.cc
@@ -519,3 +519,26 @@ profile_probability::pow (int n) const
     }
   return ret;
 }
+profile_count
+profile_count::operator* (const sreal &num) const
+{
+  if (m_val == 0)
+    return *this;
+  if (!initialized_p ())
+    return uninitialized ();
+  sreal scaled = num * m_val;
+  gcc_checking_assert (scaled >= 0);
+  profile_count ret;
+  if (m_val > max_count)
+    ret.m_val = max_count;
+  else
+    ret.m_val = scaled.to_nearest_int ();
+  ret.m_quality = MIN (m_quality, ADJUSTED);
+  return ret;
+}
+
+profile_count
+profile_count::operator*= (const sreal &num)
+{
+  return *this * num;
+}
diff --git a/gcc/profile-count.h b/gcc/profile-count.h
index 015aee981ca..0e79fd241b5 100644
--- a/gcc/profile-count.h
+++ b/gcc/profile-count.h
@@ -1061,6 +1061,9 @@ public:
       return *this;
     }
 
+  profile_count operator* (const sreal &num) const;
+  profile_count operator*= (const sreal &num);
+
   profile_count operator/ (int64_t den) const
     {
       return apply_scale (1, den);
diff --git a/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c 
b/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c
new file mode 100644
index 00000000000..7c7b27e3829
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/ipa/ipa-clone-4.c
@@ -0,0 +1,30 @@
+/* { dg-options "-O3 -fdump-ipa-cp" } */
+__attribute__ ((used))
+int a[1000];
+
+__attribute__ ((noinline))
+void
+test2(int sz)
+{
+  for (int i = 0; i < sz; i++)
+         a[i]++;
+  asm volatile (""::"m"(a));
+}
+
+__attribute__ ((noinline))
+void
+test1 (int sz)
+{
+  for (int i = 0; i < 1000; i++)
+         test2(sz);
+}
+int main()
+{
+       test1(1000);
+       return 0;
+}
+/* We should clone test1 and test2 for constant 1000.
+   In the past we did not do this since we did not clone for edges that are 
not hot
+   and call main->test1 is not considered hot since it is executed just once.  
*/
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of void 
test1" 1 "cp"} } */
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of void 
test2" 1 "cp"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c 
b/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c
new file mode 100644
index 00000000000..591eb9c16c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-prof/ipa-cp-1.c
@@ -0,0 +1,30 @@
+/* { dg-options "-O2 -fdump-ipa-cp" } */
+__attribute__ ((used))
+int a[1000];
+
+__attribute__ ((noinline))
+void
+test2(int sz)
+{
+  for (int i = 0; i < sz; i++)
+         a[i]++;
+  asm volatile (""::"m"(a));
+}
+
+__attribute__ ((noinline))
+void
+test1 (int sz)
+{
+  for (int i = 0; i < 1000; i++)
+         test2(sz);
+}
+int main()
+{
+       test1(1000);
+       return 0;
+}
+/* We should clone test1 and test2 for constant 1000.
+   In the past we did not do this since we did not clone for edges that are 
not hot
+   and call main->test1 is not considered hot since it is executed just once.  
*/
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of void 
test1" 1 "cp"} } */
+/* { dg-final-use { scan-ipa-dump-times "Creating a specialized node of void 
test2" 1 "cp"} } */

Make ipa-cp propagate over non-hot calls

Reply via email to