https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hubicka at gcc dot gnu.org, | |luoxhu at gcc dot gnu.org --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- I can reproduce a regression with -Ofast -march=znver2 running on Haswell as well. -fopt-info doesn't reveal anything interesting besides -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 32987933) +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 129072791) obviously the slowdown is in P7Viterbi. There's only minimal changes on the GIMPLE side, one notable: niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 | _2041 = niters.203_438 & 3; _2408 = (int) niters_vector_mult_vf.205_2406; | if (_2041 == 0) tmp.206_2407 = k_384 + _2408; | goto <bb 66>; [25.00%] _2300 = niters.203_442 & 3; < if (_2300 == 0) < goto <bb 65>; [25.00%] < else else goto <bb 36>; [75.00%] goto <bb 36>; [75.00%] <bb 36> [local count: 41646173]: | <bb 36> [local count: 177683003]: # k_2403 = PHI <tmp.206_2407(35), tmp.239_2637(34)> | niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729 # DEBUG k => k_2403 | _2411 = (int) niters_vector_mult_vf.205_2409; > tmp.206_2410 = k_382 + _2411; > > <bb 37> [local count: 162950122]: > # k_2406 = PHI <tmp.206_2410(36), tmp.239_2639(34)> the sink pass now does the transform where it did not do so before. That's appearantly because of /* If BEST_BB is at the same nesting level, then require it to have significantly lower execution frequency to avoid gratuitous movement. */ if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) /* If result of comparsion is unknown, prefer EARLY_BB. Thus use !(...>=..) rather than (...<...) */ && !(best_bb->count * 100 >= early_bb->count * threshold)) return best_bb; /* No better block found, so return EARLY_BB, which happens to be the statement's original block. */ return early_bb; where the SRC count is 96726596 before, 236910671 after and the destination count is 72544947 before, 177683003 at the destination after. The edge probabilities are 75% vs 25% and param_sink_frequency_threshold is exactly 75 as well. Since 236910671*0.75 is rounded down it passes the test while the previous state has an exact match defeating it. It's a little bit of an arbitrary choice, diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc index 2e744d6ae50..9b368e13463 100644 --- a/gcc/tree-ssa-sink.cc +++ b/gcc/tree-ssa-sink.cc @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb, if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) /* If result of comparsion is unknown, prefer EARLY_BB. Thus use !(...>=..) rather than (...<...) */ - && !(best_bb->count * 100 >= early_bb->count * threshold)) + && !(best_bb->count * 100 > early_bb->count * threshold)) return best_bb; /* No better block found, so return EARLY_BB, which happens to be the fixes the missed sinking but not the regression :/ The count differences start to appear in when LC PHI blocks are added only for virtuals and then pre-existing 'Invalid sum of incoming counts' eventually lead to mismatches. The 'Invalid sum of incoming counts' start with the loop splitting pass. fast_algorithms.c:145:10: optimized: loop split Xionghu Lou did profile count updates there, not sure if that made things worse in this case. At least with broken BB counts splitting/unsplitting an edge can propagate bogus counts elsewhere it seems.