https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649

--- Comment #6 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I tried zen3 with -march=native -Ofast 

Samples: 1M of event 'cycles:u', Event count (approx.): 2309002237334, DSO: s
Overhead  Command          Symbol                                            
  42.51%  sphinx_livepret  [.] mgau_eval                                    ◆
  24.36%  sphinx_livepret  [.] vector_gautbl_eval_logs3                     ▒
   6.81%  sphinx_livepret  [.] subvq_mgau_shortlist                         ▒
   6.43%  sphinx_livepret  [.] logs3_add                                    ▒
   4.91%  sphinx_livepret  [.] approx_cont_mgau_frame_eval                  ▒
   4.32%  sphinx_livepret  [.] mdef_sseq2sen_active                         ▒
   2.62%  sphinx_livepret  [.] dict2pid_comsenscr                           ▒
   1.50%  sphinx_livepret  [.] hmm_vit_eval_3st                             ▒
   0.84%  sphinx_livepret  [.] lextree_hmm_eval                             ▒
   0.67%  sphinx_livepret  [.] lextree_hmm_propagate                        ▒
   0.64%  sphinx_livepret  [.] lextree_enter                                ▒
   0.61%  sphinx_livepret  [.] fe_fft                                       ▒
   0.45%  sphinx_livepret  [.] dict2pid_comsseq2sen_active                  ▒
   0.32%  sphinx_livepret  [.] lextree_ssid_active                          ▒
   0.18%  sphinx_livepret  [.] vithist_rescore                              ▒
   0.14%  sphinx_livepret  [.] utt_decode_block                             ▒
   0.12%  sphinx_livepret  [.] fe_mel_cep                                   ▒

Prior vectorizing there is no invalid profile in mgau_eval.
Loop is
        for (c = 0; c < mgau->n_comp-1; c += 2) {       /* Interleave 2
components for speed */
            m1 = mgau->mean[c];
            m2 = mgau->mean[c+1];
            v1 = mgau->var[c];
            v2 = mgau->var[c+1];
            dval1 = mgau->lrd[c];
            dval2 = mgau->lrd[c+1];

            for (i = 0; i < veclen; i++) {
                diff1 = x[i] - m1[i];
                dval1 -= diff1 * diff1 * v1[i];
                diff2 = x[i] - m2[i];
                dval2 -= diff2 * diff2 * v2[i];
                /*              E_INFO("x %10f m1 %10f m2 %10f v1 %10f, v2
%10f\n",x[i],m1[i],m2[i],v1[i],v2[i]);
                                E_INFO("diff1 %10f,dval1 %10f, diff2 %10f,
dval2 %10f\n",diff1,dval1,diff2,dval2);*/
            }

            if (dval1 < g->distfloor)   /* Floor */
                dval1 = g->distfloor;
            if (dval2 < g->distfloor)
                dval2 = g->distfloor;

            score = logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]);
            score = logs3_add (score, (int32)(f * dval2) + mgau->mixw[c+1]);
        }
and the inner loop iterates 47 times on average. Vectorizer has profitaiblity
threshold 8 and vectorizes to 32bit vectors.
Epilogue has threshold 4 and is vectorized with 16bit vector.

There is second similar loop nest in the function:
        for (j = 0; active[j] >= 0; j++) {
#ifdef SPEC_CPU
            considered++;
#endif
            c = active[j];

            m1 = mgau->mean[c];
            v1 = mgau->var[c];
            dval1 = mgau->lrd[c];

            for (i = 0; i < veclen; i++) {
                diff1 = x[i] - m1[i];
                dval1 -= diff1 * diff1 * v1[i];
            }

            if (dval1 < g->distfloor)
                dval1 = g->distfloor;

            score = logs3_add (score, (int32)(f * dval1) + mgau->mixw[c]);
        }
which is executed 10% of time and also vectorized twice.

We then believe that the inner loop iterates 5 times (I would expect 47/4
times).

In cunroll pass we then see:
   Loop 4 iterates at most 2147483647 times. 
   Loop 4 likely iterates at most 2147483647 times.
   Not unrolling loop 4 (--param max-completely-peel-times limit reached).

This is the outer loop

   Loop 7 iterates at most 2 times.
   Loop 7 likely iterates at most 2 times.
  Loop size: 22
  Estimated size after unrolling: 42
  cont_mgau.c:604:20: optimized: loop with 2 iterations completely unrolled
(header execution count 1065258)

this is the scalar epilogue loop.

   Loop 6 iterates at most 0 times.
   Loop 6 likely iterates at most 0 times.
   cont_mgau.c:575:7: optimized: loop turned into non-loop; it never loops

This is the vectorized epilogue loop (really non-loop).

So this looks OK, but introduced one mismatch in profile. Before the pass we
had:
;;   basic block 14, loop depth 2, count 171249098 (guessed, freq 23.9461),
maybe hot
;;    prev block 51, next block 66, flags: (NEW, VISITED)
;;    pred:       24 [always]  count:142707582 (guessed, freq 19.9550)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;                51 [always]  count:28541516 (guessed, freq 3.9910)
(FALLTHRU,EXECUTABLE)

and now we get:
;;   basic block 14, loop depth 2, count 13764235 (guessed, freq 1.9247), maybe
hot
;;   Invalid sum of incoming counts 25234431 (guessed, freq 3.5286), should be
13764235 (guessed, freq 1.9247)
;;    prev block 83, next block 66, flags: (NEW, VISITED)
;;    pred:       24 [always]  count:11470196 (guessed, freq 1.6039)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;                83 [always]  count:13764235 (guessed, freq 1.9247)
(FALLTHRU,EXECUTABLE)

this does look wrong, since the loop was not unroled yet it profile was reduced
significantl.

I also noticed that in other (not hot) function we get following BB with
nonsential exit edges:

;;   basic block 74, loop depth 3, count 258660 (guessed, freq 258660.0000),
maybe hot
;;   Invalid sum of outgoing probabilities 120.0%
;;    prev block 155, next block 175, flags: (NEW, REACHABLE, VISITED)
;;    pred:       97 [always]  count:215550 (guessed, freq 215550.0000)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;;                155 [always]  count:43110 (guessed, freq 43110.0000)
(FALLTHRU,EXECUTABLE)
  # i_212 = PHI <i_232(97), 0(155)>
  # n_94 = PHI <_453(97), n_244(155)>
  # vect_n_94.158_583 = PHI <vect__453.169_601(97), { 0, 0, 0, 0, 0, 0, 0, 0
}(155)>
  # vectp.159_584 = PHI <vectp.159_585(97), _222(155)>
  # vectp.165_594 = PHI <vectp.165_595(97), _222(155)>
  # ivtmp_612 = PHI <ivtmp_613(97), 0(155)>
  # DEBUG BEGIN_STMT
  _224 = (long unsigned int) i_212;
  _225 = _224 * 4;
  _226 = _222 + _225;
  vect__227.161_586 = MEM <vector(8) float> [(float32 *)vectp.159_584];
  _227 = *_226;
  vect__228.162_587 = [vec_unpack_lo_expr] vect__227.161_586;
  vect__228.162_588 = [vec_unpack_hi_expr] vect__227.161_586;
  _228 = (double) _227;
  mask__470.163_590 = vect_cst__589 > vect__228.162_587;
  mask__470.163_591 = vect_cst__589 > vect__228.162_588;
  _470 = varfloor_23(D) > _228;
  # DEBUG BEGIN_STMT
  mask_patt_538.164_592 = VEC_PACK_TRUNC_EXPR <mask__470.163_590,
mask__470.163_591>;
  if (mask_patt_538.164_592 == { 0, 0, 0, 0, 0, 0, 0, 0 })
    goto <bb 174>; [100.00%]
  else
    goto <bb 175>; [20.00%]

Edge to 174 seems just worng:

;;   basic block 174, loop depth 3, count 258660 (guessed, freq 258660.0000),
maybe hot
;;   Invalid sum of incoming counts 310392 (guessed, freq 310392.0000), should
be 258660 (guessed, freq 258660.0000)
;;    prev block 175, next block 97, flags: (NEW, VISITED)
;;    pred:       74 [always]  count:258660 (guessed, freq 258660.0000)
(TRUE_VALUE,EXECUTABLE)
;;                175 [always]  count:51732 (guessed, freq 51732.0000)
(FALLTHRU,EXECUTABLE)
  # DEBUG BEGIN_STMT
  #

So if the probability was 80% it would be almost right.

This problem repeats twice.

Reply via email to