Re: [Patch] Handle profile counts truncated to 0 in coldness test

Teresa Johnson Thu, 10 Oct 2013 14:36:31 -0700

On Mon, Oct 7, 2013 at 10:45 PM, Teresa Johnson <tejohn...@google.com> wrote:
> On Sun, Oct 6, 2013 at 6:43 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>>> 2013-10-03  Teresa Johnson  <tejohn...@google.com>
>>>
>>>         * params.def (PARAM_MIN_HOT_RUN_RATIO): New parameter.
>>
>> I would not mention HOT in the parameter name.  Blocks are 
>> hot/normal/unlikely
>> and we HOT_BB_FREQUENCY_FRACTION.
>>
>> So perhaps UNLIKELY_BB_COUNT_FRACTION with an explanation that it is relative
>> to number of train runs?
>
> Ok.
>
>>> +DEFPARAM(PARAM_MIN_HOT_RUN_RATIO,
>>> +        "min-hot-run-ratio",
>>> +         "The minimum ratio of profile runs to basic block execution count 
>>> "
>>> +         "for the block to be considered hot",
>> Perhaps as:
>>> +         "The maximum ratio of profile runs to basic block execution count 
>>> "
>>> +         "for the block to be considered unlikely",
>>
>>> +        4, 0, 10000)
>>
>> lower bound should be 1 or we divide by 0.
>
> Yep, will fix.
>
>>> Index: predict.c
>>> ===================================================================
>>> --- predict.c   (revision 203152)
>>> +++ predict.c   (working copy)
>>> @@ -237,17 +237,31 @@ probably_never_executed (struct function *fun,
>>>    gcc_checking_assert (fun);
>>>    if (profile_status_for_function (fun) == PROFILE_READ)
>>>      {
>>> -      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>> +      int min_run_ratio = PARAM_VALUE (PARAM_MIN_HOT_RUN_RATIO);
>>> +      if (RDIV (count * min_run_ratio, profile_info->runs) > 0)
>>
>> This way if count is 1, runs needs to be n_run_ratio * 2
>> So perhaps if (count * min_run_ratio >= profile_info->runs) instead of 
>> division.
>
> Ok.
>
>>> -      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < 
>>> REG_BR_PROB_BASE)
>>> +      if (ENTRY_BLOCK_PTR->count)
>>>         {
>>> -         return (RDIV (frequency * ENTRY_BLOCK_PTR->count,
>>> -                       ENTRY_BLOCK_PTR->frequency)
>>> -                 < REG_BR_PROB_BASE / 4);
>>> +          gcov_type scaled_count
>>> +              = frequency * ENTRY_BLOCK_PTR->count * min_run_ratio;
>>> +          gcov_type computed_count;
>>> +          /* Check for overflow, in which case ENTRY_BLOCK_PTR->count 
>>> should
>>> +             be large enough to do the division first without losing much
>>> +             precision.  */
>>> +          if (scaled_count/frequency/min_run_ratio != 
>>> ENTRY_BLOCK_PTR->count)
>>
>> I do not like the undefined behaviour triggered before the check (sure 
>> unsigned
>> arithmetic would fix that, but it seems all strange).  What about guarding 
>> the
>> whole code.
>>
>> if (ENTRY_BLOCK_PTR->count && count < REG_BR_PROB_BASE)
>> For large counts the roudoff problems in first test should not be problem.
>
> Sure, that sounds reasonable.


Actually, we don't care about count at this point, just ENTRY_BLOCK_PTR->count.

>
>>> +            {
>>> +              computed_count = RDIV (ENTRY_BLOCK_PTR->count,
>>> +                                     ENTRY_BLOCK_PTR->frequency);
>>> +              computed_count *= frequency * min_run_ratio;
>>> +            }
>>> +          else
>>> +            computed_count = RDIV (scaled_count, 
>>> ENTRY_BLOCK_PTR->frequency);
>>> +          if (RDIV (computed_count * min_run_ratio, profile_info->runs) > 
>>> 0)
>>
>> So at nonoverflow path you compute
>>
>>    ((frequency * entry_bb_count * min_run_ratio) / entry_bb_frequency) * 
>> min_run_ratio / runs > 0.5
>
> Yes, there is an extra min_run_ratio factor in there that I need to remove!
>>
>> I think you want
>>
>>    ((frequency * entry_bb_count * min_run_ratio) / entry_bb_frequency >= runs
>
> Right, will change that too.

Note that the above change essentially decreases the min_run_ratio in
half (since we are comparing >= runs instead of >= 0.5 runs due to the
rounding divide).

>
>>
>> If entry_bb_frequency is 0, you can just return true iff frequency is > 
>> min_run_ratio.
>
> We already return false conservatively if entry_bb_frequency is 0. Do
> we really want to change this - I'm not sure it makes sense to compare
> the frequency to the run ratio directly.
>
>>
>> I think safe way to compute this is
>>
>> if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
>>   scaled_count = ((frequency * entry_bb_count * min_run_ratio) / 
>> entry_bb_frequency;
>> else
>>   scaled_count = ((frequency * entry_bb_count) / entry_bb_frequency) * 
>> min_run_ratio
>>
>> In first path, we have at most 10000*10000*10000*10000 that still fits in 
>> 64bit
>> value.
>>
>> In the second path the base of fixed point arithmetic is large enough so the
>> division should not matter. We never get over entry_count * 10000 that is 
>> safe.
>> The first part of statement however just compute value that should match 
>> count
>> so count > 10000 unless profile got misupated and the low-count roundoff 
>> errors should
>> not matter. So perhaps the whole code can execute only if
>> if (ENTRY_BLOCK_PTR->count && count < REG_BR_PROB_BASE
>>     && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
>> and we do not need any conditionals for scaled_count

As noted above, we don't use count here, so this can be simplified to:

if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)

(and ENTRY_BLOCK_PTR->count > 0, which is checked earlier). See the
fixed patch below.

>>
>> Does such code work with ratio changed to 16?  I think the double 
>> multiplication in your
>> patch may have caused problems.
>
> Let me fix this up and will get back.

As noted above, the change to avoid the RDIV by runs decreases the
effect of the min_run_ratio (now unlikely_count_fraction) in half. So
we really need to increase this parameter to 32 to compare to what the
old version of the code was doing.

Indeed, using 32 does fix the same set of issues. However, when
setting the parameter to the old default of 4, there are 5 core dumps
using the patch to make the unlikely code unexecutable with a 100
training runs (see below for some examples why). This goes down to 2
failures if I set the parameter to 10, and 0 if I set it to 20. Using
20 essentially means that the code has to be executed 5% of the runs,
which seems reasonable for something like splitting. What do you
think?

A couple examples why there was cold split code being executed with
100 train runs.

1) No context sensitivity for routine that is inlined:

In 
/usr/local/google/home/tejohnson/gcc_trunk_5/gcc/testsuite/gcc.c-torture/execute/pr33870.c,
we call merge_pagelist 3 times from sort_pagelist. In the profile-use
compile merge_pagelist is inlined (I believe in all 3 locations). One
of the conditional blocks (line 23) has a profile count that is less
than runs/4. It turns out that from the first merge_pagelist callsite
we always execute this block, but the other two are completely cold.
But the profile info from all three calls is of course merged, and it
looks like it is executed infrequently overall. So the block is split
but then we execute it from the inline instance corresponding to the
first call.

I'm not sure what we can do in general here. But it is another data
point suggesting to me that the blocks should be very cold to be
split.

 2) RTL expansion:

Another was in the following code from
/usr/local/google/home/tejohnson/gcc_trunk_5/gcc/testsuite/gcc.c-torture/execute/pr43784.c:

static struct s __attribute__((noinline)) rp(void)
{
  return *p;
}

where p is a global that is assigned as:

static union u v;
static struct s *p = &v.d.b;

RTL expansion synthesizes a bunch of tests and branches (alignment
handling?), and guesses at the probabilities of the resulting
branches. Unfortunately, this is less than accurate and we end up
executing a block that is given a small likelihood, making it below
the threshold with the default unlikely fraction of 4.

Here is the patch with the current threshold kept as 4. Let me know if
you think we can kick this up to 20, or if we should just increase the
threshold for splitting.

Passes regression tests, ok for trunk?

(BTW, getting close. I have a bunch of fixes to the loop unroller that
need some cleanup, and a small fix to ssa tail merging to update the
profiles, but I think that does it for the tests I was looking at.
Will send those soon for review.)

Thanks,
Teresa

2013-10-10  Teresa Johnson  <tejohn...@google.com>

        * predict.c (probably_never_executed): Compare frequency-based
        count to number of training runs.
        * params.def (UNLIKELY_BB_COUNT_FRACTION): New parameter.

Index: predict.c
===================================================================
--- predict.c   (revision 203390)
+++ predict.c   (working copy)
@@ -237,17 +237,33 @@ probably_never_executed (struct function *fun,
   gcc_checking_assert (fun);
   if (profile_status_for_function (fun) == PROFILE_READ)
     {
-      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
+      int unlikely_count_fraction = PARAM_VALUE (UNLIKELY_BB_COUNT_FRACTION);
+      if (count * unlikely_count_fraction >= profile_info->runs)
        return false;
       if (!frequency)
        return true;
       if (!ENTRY_BLOCK_PTR->frequency)
        return false;
-      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
+      if (ENTRY_BLOCK_PTR->count)
        {
-         return (RDIV (frequency * ENTRY_BLOCK_PTR->count,
-                       ENTRY_BLOCK_PTR->frequency)
-                 < REG_BR_PROB_BASE / 4);
+          gcov_type computed_count;
+          /* Check for possibility of overflow, in which case entry bb count
+             is large enough to do the division first without losing much
+             precision.  */
+          if (ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE * REG_BR_PROB_BASE)
+            {
+              gcov_type scaled_count
+                  = frequency * ENTRY_BLOCK_PTR->count *
unlikely_count_fraction;
+              computed_count = RDIV (scaled_count, ENTRY_BLOCK_PTR->frequency);
+            }
+          else
+            {
+              computed_count = RDIV (ENTRY_BLOCK_PTR->count,
+                                     ENTRY_BLOCK_PTR->frequency);
+              computed_count *= frequency * unlikely_count_fraction;
+            }
+          if (computed_count >= profile_info->runs)
+            return false;
        }
       return true;
     }
Index: params.def
===================================================================
--- params.def  (revision 203390)
+++ params.def  (working copy)
@@ -373,6 +373,11 @@ DEFPARAM(HOT_BB_FREQUENCY_FRACTION,
         "Select fraction of the maximal frequency of executions of
basic block in function given basic block needs to have to be
considered hot",
         1000, 0, 0)

+DEFPARAM(UNLIKELY_BB_COUNT_FRACTION,
+        "unlikely-bb-count-fraction",
+         "The minimum fraction of profile runs a given basic block
execution count must be not to be considered unlikely",
+        4, 1, 10000)
+
 DEFPARAM (PARAM_ALIGN_THRESHOLD,
          "align-threshold",
          "Select fraction of the maximal frequency of executions of
basic block in function given basic block get alignment",


>
> Thanks,
> Teresa
>
>>
>> I will check the second part of patch separately.
>> Thanks!
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: [Patch] Handle profile counts truncated to 0 in coldness test

Reply via email to