https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649
--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> --- I tried zen3 with -march=native -Ofast Samples: 1M of event 'cycles:u', Event count (approx.): 2309002237334, DSO: s Overhead Command Symbol 42.51% sphinx_livepret [.] mgau_eval ◆ 24.36% sphinx_livepret [.] vector_gautbl_eval_logs3 ▒ 6.81% sphinx_livepret [.] subvq_mgau_shortlist ▒ 6.43% sphinx_livepret [.] logs3_add ▒ 4.91% sphinx_livepret [.] approx_cont_mgau_frame_eval ▒ 4.32% sphinx_livepret [.] mdef_sseq2sen_active ▒ 2.62% sphinx_livepret [.] dict2pid_comsenscr ▒ 1.50% sphinx_livepret [.] hmm_vit_eval_3st ▒ 0.84% sphinx_livepret [.] lextree_hmm_eval ▒ 0.67% sphinx_livepret [.] lextree_hmm_propagate ▒ 0.64% sphinx_livepret [.] lextree_enter ▒ 0.61% sphinx_livepret [.] fe_fft ▒ 0.45% sphinx_livepret [.] dict2pid_comsseq2sen_active ▒ 0.32% sphinx_livepret [.] lextree_ssid_active ▒ 0.18% sphinx_livepret [.] vithist_rescore ▒ 0.14% sphinx_livepret [.] utt_decode_block ▒ 0.12% sphinx_livepret [.] fe_mel_cep ▒ Prior vectorizing there is no invalid profile in mgau_eval. Loop is for (c = 0; c < mgau->n_comp-1; c += 2) { /* Interleave 2 components for speed */ m1 = mgau->mean[c]; m2 = mgau->mean[c+1]; v1 = mgau->var[c]; v2 = mgau->var[c+1]; dval1 = mgau->lrd[c]; dval2 = mgau->lrd[c+1]; for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; diff2 = x[i] - m2[i]; dval2 -= diff2 * diff2 * v2[i]; /* E_INFO("x %10f m1 %10f m2 %10f v1 %10f, v2 %10f\n",x[i],m1[i],m2[i],v1[i],v2[i]); E_INFO("diff1 %10f,dval1 %10f, diff2 %10f, dval2 %10f\n",diff1,dval1,diff2,dval2);*/ } if (dval1 < g->distfloor) /* Floor */ dval1 = g->distfloor; if (dval2 < g->distfloor) dval2 = g->distfloor; score = logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]); score = logs3_add (score, (int32)(f * dval2) + mgau->mixw[c+1]); } and the inner loop iterates 47 times on average. Vectorizer has profitaiblity threshold 8 and vectorizes to 32bit vectors. Epilogue has threshold 4 and is vectorized with 16bit vector. There is second similar loop nest in the function: for (j = 0; active[j] >= 0; j++) { #ifdef SPEC_CPU considered++; #endif c = active[j]; m1 = mgau->mean[c]; v1 = mgau->var[c]; dval1 = mgau->lrd[c]; for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; } if (dval1 < g->distfloor) dval1 = g->distfloor; score = logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]); } which is executed 10% of time and also vectorized twice. We then believe that the inner loop iterates 5 times (I would expect 47/4 times). In cunroll pass we then see: Loop 4 iterates at most 2147483647 times. Loop 4 likely iterates at most 2147483647 times. Not unrolling loop 4 (--param max-completely-peel-times limit reached). This is the outer loop Loop 7 iterates at most 2 times. Loop 7 likely iterates at most 2 times. Loop size: 22 Estimated size after unrolling: 42 cont_mgau.c:604:20: optimized: loop with 2 iterations completely unrolled (header execution count 1065258) this is the scalar epilogue loop. Loop 6 iterates at most 0 times. Loop 6 likely iterates at most 0 times. cont_mgau.c:575:7: optimized: loop turned into non-loop; it never loops This is the vectorized epilogue loop (really non-loop). So this looks OK, but introduced one mismatch in profile. Before the pass we had: ;; basic block 14, loop depth 2, count 171249098 (guessed, freq 23.9461), maybe hot ;; prev block 51, next block 66, flags: (NEW, VISITED) ;; pred: 24 [always] count:142707582 (guessed, freq 19.9550) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 51 [always] count:28541516 (guessed, freq 3.9910) (FALLTHRU,EXECUTABLE) and now we get: ;; basic block 14, loop depth 2, count 13764235 (guessed, freq 1.9247), maybe hot ;; Invalid sum of incoming counts 25234431 (guessed, freq 3.5286), should be 13764235 (guessed, freq 1.9247) ;; prev block 83, next block 66, flags: (NEW, VISITED) ;; pred: 24 [always] count:11470196 (guessed, freq 1.6039) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 83 [always] count:13764235 (guessed, freq 1.9247) (FALLTHRU,EXECUTABLE) this does look wrong, since the loop was not unroled yet it profile was reduced significantl. I also noticed that in other (not hot) function we get following BB with nonsential exit edges: ;; basic block 74, loop depth 3, count 258660 (guessed, freq 258660.0000), maybe hot ;; Invalid sum of outgoing probabilities 120.0% ;; prev block 155, next block 175, flags: (NEW, REACHABLE, VISITED) ;; pred: 97 [always] count:215550 (guessed, freq 215550.0000) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 155 [always] count:43110 (guessed, freq 43110.0000) (FALLTHRU,EXECUTABLE) # i_212 = PHI <i_232(97), 0(155)> # n_94 = PHI <_453(97), n_244(155)> # vect_n_94.158_583 = PHI <vect__453.169_601(97), { 0, 0, 0, 0, 0, 0, 0, 0 }(155)> # vectp.159_584 = PHI <vectp.159_585(97), _222(155)> # vectp.165_594 = PHI <vectp.165_595(97), _222(155)> # ivtmp_612 = PHI <ivtmp_613(97), 0(155)> # DEBUG BEGIN_STMT _224 = (long unsigned int) i_212; _225 = _224 * 4; _226 = _222 + _225; vect__227.161_586 = MEM <vector(8) float> [(float32 *)vectp.159_584]; _227 = *_226; vect__228.162_587 = [vec_unpack_lo_expr] vect__227.161_586; vect__228.162_588 = [vec_unpack_hi_expr] vect__227.161_586; _228 = (double) _227; mask__470.163_590 = vect_cst__589 > vect__228.162_587; mask__470.163_591 = vect_cst__589 > vect__228.162_588; _470 = varfloor_23(D) > _228; # DEBUG BEGIN_STMT mask_patt_538.164_592 = VEC_PACK_TRUNC_EXPR <mask__470.163_590, mask__470.163_591>; if (mask_patt_538.164_592 == { 0, 0, 0, 0, 0, 0, 0, 0 }) goto <bb 174>; [100.00%] else goto <bb 175>; [20.00%] Edge to 174 seems just worng: ;; basic block 174, loop depth 3, count 258660 (guessed, freq 258660.0000), maybe hot ;; Invalid sum of incoming counts 310392 (guessed, freq 310392.0000), should be 258660 (guessed, freq 258660.0000) ;; prev block 175, next block 97, flags: (NEW, VISITED) ;; pred: 74 [always] count:258660 (guessed, freq 258660.0000) (TRUE_VALUE,EXECUTABLE) ;; 175 [always] count:51732 (guessed, freq 51732.0000) (FALLTHRU,EXECUTABLE) # DEBUG BEGIN_STMT # So if the probability was 80% it would be almost right. This problem repeats twice.