[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 14 Jul 2022 05:10:20 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |luoxhu at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can reproduce a regression with -Ofast -march=znver2 running on Haswell as
well.  -fopt-info doesn't reveal anything interesting besides

-fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
unrolled (header execution count 32987933)
+fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
unrolled (header execution count 129072791)

obviously the slowdown is in P7Viterbi.  There's only minimal changes on the
GIMPLE side, one notable:

  niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 |   _2041 =
niters.203_438 & 3;
  _2408 = (int) niters_vector_mult_vf.205_2406;               |   if (_2041 ==
0)
  tmp.206_2407 = k_384 + _2408;                               |     goto <bb
66>; [25.00%]
  _2300 = niters.203_442 & 3;                                 <
  if (_2300 == 0)                                             <
    goto <bb 65>; [25.00%]                                    <
  else                                                            else
    goto <bb 36>; [75.00%]                                          goto <bb
36>; [75.00%]

  <bb 36> [local count: 41646173]:                            |   <bb 36>
[local count: 177683003]:
  # k_2403 = PHI <tmp.206_2407(35), tmp.239_2637(34)>         |  
niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729
  # DEBUG k => k_2403                                         |   _2411 = (int)
niters_vector_mult_vf.205_2409;
                                                              >   tmp.206_2410
= k_382 + _2411;
                                                              >
                                                              >   <bb 37>
[local count: 162950122]:
                                                              >   # k_2406 =
PHI <tmp.206_2410(36), tmp.239_2639(34)>

the sink pass now does the transform where it did not do so before.

That's appearantly because of

  /* If BEST_BB is at the same nesting level, then require it to have
     significantly lower execution frequency to avoid gratuitous movement.  */
  if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
      /* If result of comparsion is unknown, prefer EARLY_BB.
         Thus use !(...>=..) rather than (...<...)  */
      && !(best_bb->count * 100 >= early_bb->count * threshold))
    return best_bb;

  /* No better block found, so return EARLY_BB, which happens to be the
     statement's original block.  */
  return early_bb;

where the SRC count is 96726596 before, 236910671 after and the
destination count is 72544947 before, 177683003 at the destination after.
The edge probabilities are 75% vs 25% and param_sink_frequency_threshold
is exactly 75 as well.  Since 236910671*0.75
is rounded down it passes the test while the previous state has an exact
match defeating it.

It's a little bit of an arbitrary choice,

diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 2e744d6ae50..9b368e13463 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -230,7 +230,7 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
       /* If result of comparsion is unknown, prefer EARLY_BB.
         Thus use !(...>=..) rather than (...<...)  */
-      && !(best_bb->count * 100 >= early_bb->count * threshold))
+      && !(best_bb->count * 100 > early_bb->count * threshold))
     return best_bb;

   /* No better block found, so return EARLY_BB, which happens to be the

fixes the missed sinking but not the regression :/

The count differences start to appear in when LC PHI blocks are added
only for virtuals and then pre-existing 'Invalid sum of incoming counts'
eventually lead to mismatches.  The 'Invalid sum of incoming counts'
start with the loop splitting pass.

fast_algorithms.c:145:10: optimized: loop split

Xionghu Lou did profile count updates there, not sure if that made things
worse in this case.

At least with broken BB counts splitting/unsplitting an edge can propagate
bogus counts elsewhere it seems.

[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

Reply via email to