Hi Both, Thanks for all the reviews/patches so far 😊
> > > > Looks good, but I wonder what we can do to at least make the multiple > > exit case behave reasonably? The vectorizer keeps track > > > of a "canonical" exit, would it be possible to pass in the main exit > > edge and use that instead of single_exit (), would other exits then > > behave somewhat reasonable or would we totally screw things up here? > > That is, the "canonical" exit would be the counting exit while the > > other exits are on data driven conditions and thus wouldn't change > > probability when we reduce the number of iterations(?) > > I can add canonical_exit parameter and make the function to direct flow to it > if > possible. However overall I think fixup depends on what transformation led to > the change. > > Assuming that vectorizer did no prologues and apilogues and we vectorized > with factor N, then I think the update could be done more specifically as > follows. > If it helps, how this patch series addresses multiple exits by forcing a scalar epilogue, all non canonical_exits would have been redirected to this scalar epilogue, so the remaining scalar iteration count will be at most VF. Regards, Tamar > We know that header block count dropped by 4. So we can start from that > and each time we reach basic block with exit edge, we know the original count > of the edge. This count is unchanged, so one can rescale probabilities out of > that BB accordingly. If loop has no inner loops, we can just walk the body in > RPO and propagate scales downwards and we sould arrive to right result > > I originally added the bound parameter to handle prologues/epilogues which > gets new artificial bound. In prologue I think you are right that the flow > will be > probably directed to the conditional counting iterations. > > In epilogue we add no artificial iteration cap, so maybe it is more realistic > to > simply scale up probability of all exits? > > To see what is going on I tried following testcase: > > int a[99]; > test() > { > for (int i = 0; i < 99; i++) > a[i]++; > } > > What surprises me is that vectorizer at -O2 does nothing and we end up > unrolling the loop: > > L2: > addl $1, (%rax) > addl $1, 4(%rax) > addl $1, 8(%rax) > addq $12, %rax > cmpq $a+396, %rax > > Which seems sily thing to do. Vectorized loop with epilogue doing 2 and > 1 addition would be better. > > With -O3 we vectorize it: > > > .L2: > movdqa (%rax), %xmm0 > addq $16, %rax > paddd %xmm1, %xmm0 > movaps %xmm0, -16(%rax) > cmpq %rax, %rdx > jne .L2 > movq a+384(%rip), %xmm0 > addl $1, a+392(%rip) > movq .LC1(%rip), %xmm1 > paddd %xmm1, %xmm0 > movq %xmm0, a+384(%rip) > > > and correctly drop vectorized loop body to 24 iterations. However the > epilogue has loop for vector size 2 predicted to iterate once (it won't) > > ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe > hot > ;; prev block 5, next block 8, flags: (NEW, VISITED) > ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally) > (FALSE_VALUE,EXECUTABLE) > ;; succ: 8 [always] count:10737416 (estimated locally) > (FALLTHRU,EXECUTABLE) > > ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe > hot > ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) > ;; pred: 9 [always] count:10737417 (estimated locally) > (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 7 [always] count:10737416 (estimated locally) > (FALLTHRU,EXECUTABLE) > # i_9 = PHI <i_17(9), 96(7)> > # ivtmp_13 = PHI <ivtmp_18(9), 3(7)> > # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + > 384B](7)> > # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + > 384B](7)> > # ivtmp_49 = PHI <ivtmp_50(9), 0(7)> > vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40]; > _14 = a[i_9]; > vect__15.17_44 = vect__14.16_42 + { 1, 1 }; > _15 = _14 + 1; > MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44; > i_17 = i_9 + 1; > ivtmp_18 = ivtmp_13 - 1; > vectp_a.14_41 = vectp_a.14_40 + 8; > vectp_a.18_47 = vectp_a.18_46 + 8; > ivtmp_50 = ivtmp_49 + 1; > if (ivtmp_50 < 1) > goto <bb 9>; [50.00%] > else > goto <bb 12>; [50.00%] > > and finally the scalar copy > > ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe > hot > ;; prev block 9, next block 13, flags: (NEW, VISITED) > ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) > (FALSE_VALUE,EXECUTABLE) > ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) > > ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), > maybe hot > ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) > ;; pred: 14 [always] count:1052266996 (estimated locally) > (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) > # i_30 = PHI <i_36(14), 98(12)> > # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> > _33 = a[i_30]; > _34 = _33 + 1; > a[i_30] = _34; > i_36 = i_30 + 1; > ivtmp_37 = ivtmp_32 - 1; > if (ivtmp_37 != 0) > goto <bb 14>; [98.99%] > else > goto <bb 4>; [1.01%] > > With also small but non-zero iteration probability. This is papered > over by my yesterday patch. But it seems to me that it would be a lot better > if > vectorizer understood that the epilogue will be loopless and accounted it to > the cost model that would probably make it easy to enable it at cheap costs > too. > > Clang 16 at -O2 is much more aggressive by both vectorizing and unroling: > > test: # @test > .cfi_startproc > # %bb.0: > movdqa a(%rip), %xmm1 > pcmpeqd %xmm0, %xmm0 > psubd %xmm0, %xmm1 > movdqa %xmm1, a(%rip) > movdqa a+16(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+16(%rip) > movdqa a+32(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+32(%rip) > movdqa a+48(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+48(%rip) > movdqa a+64(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+64(%rip) > movdqa a+80(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+80(%rip) > movdqa a+96(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+96(%rip) > movdqa a+112(%rip), %xmm1 > psubd %xmm0, %xmm1 > .... > movdqa %xmm1, a+240(%rip) > movdqa a+256(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+256(%rip) > movdqa a+272(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+272(%rip) > movdqa a+288(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+288(%rip) > movdqa a+304(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+304(%rip) > movdqa a+320(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+320(%rip) > movdqa a+336(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+336(%rip) > movdqa a+352(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+352(%rip) > movdqa a+368(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+368(%rip) > addl $1, a+384(%rip) > addl $1, a+388(%rip) > addl $1, a+392(%rip) > retq > > Honza