https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117251
--- Comment #4 from Michael Meissner <meissner at gcc dot gnu.org> --- I tracked down the commit that first made the slowdown visible: commit 3a61ca1b9256535e1bfb19b2d46cde21f3908a5d (HEAD) Author: Jan Hubicka <j...@suse.cz> Date: Thu Jul 6 18:56:22 2023 +0200 Improve profile updates after loop-ch and cunroll Extend loop-ch and loop unrolling to fix profile in case the loop is known to not iterate at all (or iterate few times) while profile claims it iterates more. While this is kind of symptomatic fix, it is best we can do incase profile was originally esitmated incorrectly. In the testcase the problematic loop is produced by vectorizer and I think vectorizer should know and account into its costs that vectorizer loop and/or epilogue is not going to loop after the transformation. So it would be nice to fix it on that side, too. The patch avoids about half of profile mismatches caused by cunroll. Pass dump id and name |static mismatcdynamic mismatch |in count |in count 107t cunrolli | 3 +3| 17251 +17251 115t threadfull | 3 | 14376 -2875 116t vrp | 5 +2| 30908 +16532 117t dse | 5 | 30908 118t dce | 3 -2| 17251 -13657 127t ch | 13 +10| 17251 131t dom | 39 +26| 17251 133t isolate-paths | 47 +8| 17251 134t reassoc | 49 +2| 17251 136t forwprop | 53 +4| 202501 +185250 159t cddce | 61 +8| 216211 +13710 161t ldist | 62 +1| 216211 172t ifcvt | 66 +4| 373711 +157500 173t vect | 143 +77| 9802097 +9428386 176t cunroll | 221 +78| 15639591 +5837494 183t loopdone | 218 -3| 15577640 -61951 195t fre | 214 -4| 15577640 197t dom | 213 -1| 16671606 +1093966 199t threadfull | 215 +2| 16879581 +207975 200t vrp | 217 +2| 17077750 +198169 204t dce | 215 -2| 17004486 -73264 206t sink | 213 -2| 17004486 211t cddce | 219 +6| 17005926 +1440 255t optimized | 217 -2| 17005926 256r expand | 210 -7| 19571573 +2565647 258r into_cfglayout | 208 -2| 19571573 275r loop2_unroll | 212 +4| 22992432 +3420859 291r ce2 | 210 -2| 23011838 312r pro_and_epilogue | 230 +20| 23073776 +61938 315r jump2 | 236 +6| 27110534 +4036758 323r bbro | 229 -7| 21826835 -5283699 W/o the patch cunroll does: 176t cunroll | 294 +151|126548439 +116746342 and we end up with 291 mismatches at bbro. Bootstrapped/regtested x86_64-linux. Plan to commit it after the scale_loop_frequency patch. gcc/ChangeLog: PR middle-end/25623 * tree-ssa-loop-ch.cc (ch_base::copy_headers): Scale loop frequency to maximal number of iterations determined. * tree-ssa-loop-ivcanon.cc (try_unroll_loop_completely): Likewise. gcc/testsuite/ChangeLog: PR middle-end/25623 * gfortran.dg/pr25623-2.f90: New test. However, I backed that particular patch back out of the trunk sources, and it shows similar regressions. Here is the scale loop patch which was mentioned above, and is the adjacent patch. At present, I have not tried backing out this patch: commit d4c2e34deef8cbd81ba2ef3389fdbaf95c70e225 Author: Jan Hubicka <j...@suse.cz> Date: Thu Jul 6 18:51:02 2023 +0200 Improve scale_loop_profile Original scale_loop_profile was implemented to only handle very simple loops produced by vectorizer at that time (basically loops with only one exit and no subloops). It also has not been updated to new profile-count API very carefully. The function does two thigs 1) scales down the loop profile by a given probability. This is useful, for example, to scale down profile after peeling when loop body is executed less often than before 2) update profile to cap iteration count by ITERATION_BOUND parameter. I changed ITERATION_BOUND to be actual bound on number of iterations as used elsewhere (i.e. number of executions of latch edge) rather then number of iterations + 1 as it was before. To do 2) one needs to do the following a) scale own loop profile so frquency o header is at most the sum of in-edge counts * (iteration_bound + 1) b) update loop exit probabilities so their count is the same as before scaling. c) reduce frequencies of basic blocks after loop exit old code did b) by setting probability to 1 / iteration_bound which is correctly only of the basic block containing exit executes precisely one per iteration (it is not insie other conditional or inner loop). This is fixed now by using set_edge_probability_and_rescale_others aldo c) was implemented only for special case when the exit was just before latch bacis block. I now use dominance info to get right some of addional case. I still did not try to do anything for multiple exit loops, though the implementatoin could be generalized. Bootstrapped/regtested x86_64-linux. Plan to cmmit it tonight if there are no complains. gcc/ChangeLog: * cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge probability update to be safe on loops with subloops. Make bound parameter to be iteration bound. * tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call of scale_loop_profile. * tree-vect-loop-manip.cc (vect_do_peeling): Likewise.