> Hi Both, > > Thanks for all the reviews/patches so far 😊 > > > > > > > Looks good, but I wonder what we can do to at least make the multiple > > > exit case behave reasonably? The vectorizer keeps track > > > > > of a "canonical" exit, would it be possible to pass in the main exit > > > edge and use that instead of single_exit (), would other exits then > > > behave somewhat reasonable or would we totally screw things up here? > > > That is, the "canonical" exit would be the counting exit while the > > > other exits are on data driven conditions and thus wouldn't change > > > probability when we reduce the number of iterations(?) > > > > I can add canonical_exit parameter and make the function to direct flow to > > it if > > possible. However overall I think fixup depends on what transformation led > > to > > the change. > > > > Assuming that vectorizer did no prologues and apilogues and we vectorized > > with factor N, then I think the update could be done more specifically as > > follows. > > > > If it helps, how this patch series addresses multiple exits by forcing a > scalar > epilogue, all non canonical_exits would have been redirected to this scalar > epilogue, so the remaining scalar iteration count will be at most VF.
It looks like profile update after vectorization needs quite some TLC. My student Ondrej Kubanek also implemented loop histogram profiling which gives better idea on how commonly prologues/epilogues are needed and it would be also nice to handle it. > > ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe > > hot > > ;; prev block 9, next block 13, flags: (NEW, VISITED) > > ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) > > (FALSE_VALUE,EXECUTABLE) > > ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) > > > > ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), > > maybe hot > > ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) > > ;; pred: 14 [always] count:1052266996 (estimated locally) > > (FALLTHRU,DFS_BACK,EXECUTABLE) > > ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) > > # i_30 = PHI <i_36(14), 98(12)> > > # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> > > _33 = a[i_30]; > > _34 = _33 + 1; > > a[i_30] = _34; > > i_36 = i_30 + 1; > > ivtmp_37 = ivtmp_32 - 1; > > if (ivtmp_37 != 0) > > goto <bb 14>; [98.99%] > > else > > goto <bb 4>; [1.01%] Actually it seems that the scalar epilogue loop is with oriignal profile (predicted to iterate 99 times) which is quite wrong. Looking at the statistics for yesterday patch, on tramp3d we got 86% reduction in cummulative profile mismatches after whole optimization pipeline. More interestingly however the overall time esimtate dropped by 18%, so it seems that the profile adjustment done by cunroll are afecting the profile a lot. I think the fact that iteration counts of epilogues is not capped is one of main problems. We seem to call scale_loop_profile 3 times: scale_loop_profile (loop, prob_vector, -1); This seems to account for the probability that control flow is redirected to prolog/epilog later. So it only scales down the profile but is not responsible scale_loop_profile (prolog, prob_prolog, bound_prolog - 1); This is does prolog and sets bound. scale_loop_profile (epilog, prob_epilog, -1); This scales epilog but does not set bound at all. I think the information is availale since we update the loop_info datastructures. Honza