https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119351
--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Sorry for the delay, had a few days off. So looking at this again, it's happening When next_ci gets inlined into nbnxn_make_pairlist_part, the while loop while (next_ci(iGrid, nth, ci_block, &ci_x, &ci_y, &ci_b, &ci)) becomes vectorizable with early break. The code is quite a deep nesting of C++ templated function so it's hard to map it back to source code. the ifcvt loop that's getting vectorized is: <bb 384> [local count: 9554422]: _1643 = ci_y_1924 + 1; if (_1643 == _1649) goto <bb 385>; [5.50%] else goto <bb 458>; [94.50%] <bb 458> [local count: 9028929]: goto <bb 387>; [100.00%] <bb 385> [local count: 525493]: _1646 = ci_x_1920 + 1; <bb 386> [local count: 1081571]: # ci_x_1920 = PHI <ci_x_1919(382), _1646(385)> # ci_y_1923 = PHI <ci_y_1922(382), 0(385)> _1650 = _1649 * ci_x_1920; <bb 387> [local count: 10110500]: # ci_y_1924 = PHI <_1643(458), ci_y_1923(386)> _1651 = _1650 + ci_y_1924; _1652 = _1651 + 1; _1653 = (long unsigned int) _1652; _1655 = _1653 * 4; _1656 = _1654 + _1655; _1657 = *_1656; if (_1657 <= ci_1918) goto <bb 384>; [94.50%] else goto <bb 388>; [5.50%] Which doesn't look like it should have been vectorized. BB 384 has a non-exiting control flow. BB 385 and 386 should only conditionally be executed if _1643 != 1649 But they're not an exit, The generated code: 504: 05a03984 mov z4.s, w12 508: 25ac0fef whilelo p15.s, wzr, w12 50c: 0b01018e add w14, w12, w1 510: 04a40785 sub z5.s, z28.s, z4.s 514: 25ae0fe7 whilelo p7.s, wzr, w14 518: cb2c4019 sub x25, x0, w12, uxtw 51c: 04a600a7 add z7.s, z5.s, z6.s 520: 25075fe0 not p0.b, p7/z, p15.b 524: 8b190bcd add x13, x30, x25, lsl #2 528: 14000005 b 53c 52c: 91001210 add x16, x16, #0x4 530: 25a0c087 add z7.s, z7.s, #4 534: 25ae0e00 whilelo p0.s, w16, w14 538: 54000480 b.eq 5c8 // b.none 53c: a55041ab ld1w {z11.s}, p0/z, [x13, x16, lsl #2] 540: 248b8181 cmpge p1.s, p0/z, z12.s, z11.s 544: 250e7a22 not p2.b, p14/z, p1.b 548: 2550f840 ptest p14, p2.b 54c: 54ffff00 b.eq 52c // b.none 550: 1e2600f1 fmov w17, s7 Is really broken since it only preserved the exit check. the check for if (_1643 == _1649) goto <bb 385>; [5.50%] else goto <bb 458>; [94.50%] seems to have disappeared and indeed: pairlist.cpp:2590:5: note: ==> examining statement: if (_1643 == _1649) pairlist.cpp:2590:5: note: skip. pairlist.cpp:2590:5: note: ==> examining pattern def stmt: patt_3554 = _1643 == _1649; pairlist.cpp:2590:5: note: skip. pairlist.cpp:2590:5: note: ==> examining pattern statement: if (patt_3554 != 0) pairlist.cpp:2590:5: note: skip. the compiler knows it can't vectorize this. so why didn't we stop earlier. This should have been checked in vect_analyze_loop_form at: /* Check if we have any control flow that doesn't leave the loop. */ basic_block *bbs = get_loop_body (loop); for (unsigned i = 0; i < loop->num_nodes; i++) if (EDGE_COUNT (bbs[i]->succs) != 1 && (EDGE_COUNT (bbs[i]->succs) != 2 || !loop_exits_from_bb_p (bbs[i]->loop_father, bbs[i]))) { free (bbs); return opt_result::failure_at (vect_location, "not vectorized:" " unsupported control flow in loop.\n"); } looks like this doesn't recognize this diamond layout. Indeed it thinks 384 -> 385 is an exit.. pairlist.cpp:2590:5: note: using as main loop exit: 384 -> 385 [AUX: (nil)] pairlist.cpp:2590:5: note: === get_loop_niters === pairlist.cpp:2590:5: note: Loop has 2 exits. pairlist.cpp:2590:5: note: Analyzing exit 0... pairlist.cpp:2590:5: note: Analyzing exit 1... huh, interestingly it doesn't consider 385 as part of the loop.. iterations by profile: 8.347976 (unreliable, maybe flat) entry count:1081571 (estimated locally, freq 3.3915)) { bb_384 (preds = {bb_387 }, succs = {bb_385 bb_458 }) bb_458 (preds = {bb_384 }, succs = {bb_387 }) bb_387 (preds = {bb_458 bb_386 }, succs = {bb_384 bb_388 }) } But it clearly is.. ;; basic block 385, loop depth 2, count 525493 (estimated locally, freq 1.6478), maybe hot ;; prev block 458, next block 386, flags: (NEW, REACHABLE, VISITED) ;; pred: 384 [5.5% (guessed)] count:525493 (estimated locally, freq 1.6478) (TRUE_VALUE,EXECUTABLE) # RANGE [irange] int [1, +INF] _1646 = ci_x_1920 + 1; ;; succ: 386 [always] count:525493 (estimated locally, freq 1.6478) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; basic block 386, loop depth 2, count 1081571 (estimated locally, freq 3.3915), maybe hot ;; prev block 385, next block 387, flags: (NEW, REACHABLE, VISITED) ;; pred: 382 [always] count:556078 (estimated locally, freq 1.7437) (FALLTHRU,EXECUTABLE) ;; 385 [always] count:525493 (estimated locally, freq 1.6478) (FALLTHRU,DFS_BACK,EXECUTABLE) # .MEM_1725 = PHI <.MEM_171(382), .MEM_1725(385)> # RANGE [irange] int [0, +INF] # ci_x_1920 = PHI <ci_x_1919(382), _1646(385)> # RANGE [irange] int [0, +INF] # ci_y_1923 = PHI <ci_y_1922(382), 0(385)> _1650 = _1649 * ci_x_1920; ;; succ: 387 [always] count:1081571 (estimated locally, freq 3.3915) (FALLTHRU,EXECUTABLE) Oh I see.. BB 385 is at a loop depth of 2, but BB 384 is at a loop depth of 3. So it does leave the loop, but then re-enters it into the header, skipping the pre-header. The exit has a phi that's within a loop iteration set by another loop: ;; basic block 387, loop depth 3, count 10110500 (estimated locally, freq 31.7034), maybe hot ;; prev block 386, next block 388, flags: (NEW, REACHABLE, VISITED) ;; pred: 458 [always] count:9028929 (estimated locally, freq 28.3119) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 386 [always] count:1081571 (estimated locally, freq 3.3915) (FALLTHRU,EXECUTABLE) # RANGE [irange] int [0, +INF] so this can't really be vectorized with early exits. So we need something like this diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index bb1138bfcfb..05ae7beb1cc 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -1883,6 +1883,12 @@ vect_analyze_loop_form (class loop *loop, gimple *loop_vectorized_call, return opt_result::failure_at (vect_location, "not vectorized:" " abnormal loop exit edge.\n"); + + for (auto pred : e->src->preds) + if (pred->src->loop_father != loop) + return opt_result::failure_at (vect_location, + "not vectorized:" + " abnormal loop incoming edge.\n"); } info->conds but should also be ok if one of the preds is the loop pre-header (so the patch above is too strict). This weirdness is happening because next_ci itself is a loop, that gets inlined into the while condition. Working on a small reproducer and patch.