Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition

Teresa Johnson Mon, 23 Sep 2013 22:29:21 -0700

Hi Honza,

I am finally getting back to working on this after a few weeks of
working on some other priorities.

On Sat, Aug 31, 2013 at 2:46 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hi,
> I run my script on execute testsuite and looked into few testcases. The 
> problem I found
> was roundoff errors - i.e. when expanding switch we set 50% chage that out of 
> bound
> value is above or bellow.  Both gets rounded to 0, because switch is executed 
> once
> and the value is bellow.
>
> Partly this can be fixed by making probably_never_executed to consider 
> frequencies when
> counts are too coarse:
>
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 202133)
> +++ predict.c   (working copy)
> @@ -232,8 +232,22 @@ bool
>  probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>  {
>    gcc_checking_assert (fun);
> -  if (profile_info && flag_branch_probabilities)
> -    return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0;
> +  if (profile_status_for_function (fun) == PROFILE_READ)
> +    {
> +      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
> +       return false;
> +      if (!bb->frequency)
> +       return true;
> +      if (!ENTRY_BLOCK_PTR->frequency)
> +       return false;
> +      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < 
> REG_BR_PROB_BASE)
> +       {
> +         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
> +                       ENTRY_BLOCK_PTR->frequency)
> +                 < REG_BR_PROB_BASE / 4);
> +       }
> +      return true;
> +    }
>    if ((!profile_info || !flag_branch_probabilities)
>        && (cgraph_get_node (fun->decl)->frequency
>           == NODE_FREQUENCY_UNLIKELY_EXECUTED))

Did you mean to commit the above change? I see that it went in as part
of r202258 but doesn't show up in the ChangeLog entry for that
revision.

>
> In other cases it was mostly loop unrolling in combination with jump 
> threading. So
> I modified my script to separately report when failure happens for test 
> trained
> once and test trained hundred times.

Thanks for the linker script. I reproduced your results. I looked at a
couple cases. The first was one that failed after 1 training run only
(20000910-2.c). It was due to jump threading, which you noted was a
problem. For this one I think we can handle it in the partitioning,
since there is an FDO insanity that we could probably treat more
conservatively when splitting.

I looked at one that failed after 100 as well (20031204-1.c). In this
case, it was due to expansion which was creating multiple branches/bbs
from a logical OR and guessing incorrectly on how to assign the
counts:

 if (octets == 4 && (*cp == ':' || *cp == '\0')) {

The (*cp == ':' || *cp == '\0') part looked like the following going
into RTL expansion:

  [20031204-1.c : 31:33] _29 = _28 == 58;
  [20031204-1.c : 31:33] _30 = _28 == 0;
  [20031204-1.c : 31:33] _31 = _29 | _30;
  [20031204-1.c : 31:18] if (_31 != 0)
    goto <bb 16>;
  else
    goto <bb 19>;

where the result of the OR was always true, so bb 16 had a count of
100 and bb 19 a count of 0. When it was expanded, the expanded version
of the above turned into 2 bbs with a branch in between. Both
comparisons were done in the first bb, but the first bb checked
whether the result of the *cp == '\0' compare was true, and if not
branched to the check for whether the *cp == ':' compare was true. It
gave the branch to the second check against ':' a count of 0, so that
bb got a count of 0 and was split out, and put the count of 100 on the
fall through assuming the compare with '\0' always evaluated to true.
In reality, this OR condition was always true because *cp was ':', not
'\0'. Therefore, the count of 0 on the second block with the check for
':' was incorrect, we ended up trying to execute it, and failed.

Presumably we had the correct profile data for both blocks, but the
accuracy was reduced when the OR was represented as a logical
computation with a single branch. We could change the expansion code
to do something different, e.g. treat as a 50-50 branch. But we would
still end up with integer truncation issues when there was a single
training run. But that could be dealt with conservatively in the
bbpart code as I suggested for the jump threading issue above. I.e. a
cold block with incoming non-cold edges conservatively not marked cold
for splitting.

>
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>
> FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> We should take look at FAILs and see if there are bugs to fix. For FAIL1
> I think it is kind of design problem: while implementing counts&frequencies
> the idea was that small counts do not matter, so integer arithmetic is all
> right.
>
> I wonder if with current C++ wonderland we can't simply switch count
> to a better representation. Either sreal or fixedpoint with capping
> (the integer overflow issues are tiring, too).

It also seems like we should be able to detect the profile insanities
caused by integer truncation and handle them conservatively. That
being said, I see some sreal uses already in the profile.c code, so
presumably we could use this for the counts as well if it turns out to
be necessary?

BTW, Rong also implemented his runtime patch to do the COMDAT profile
merging. However, that ended up having some issues, that were solvable
but would have caused us to lose all context sensitivity from COMDATS
inlined during the profile-gen build. I am going to go back to solving
this in the profile-use phase as we discussed in the separate thread
on the COMDAT inlining patch I had been working on.

Thanks,
Teresa

>
> Honza

-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition

Reply via email to