https://llvm.org/bugs/show_bug.cgi?id=29065
Bug ID: 29065 Summary: [SCEV] scev expansion generates redundent code in memcheck during vectorization Product: libraries Version: trunk Hardware: PC OS: All Status: NEW Severity: normal Priority: P Component: Scalar Optimizations Assignee: unassignedb...@nondot.org Reporter: w...@google.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified Testcase 1.c: ----------------------------------- #define COMBP(d, s, m) ( ((d) & ~(m)) | ((s) & (m)) ) void foo(unsigned *dd, int dw, int dh, unsigned *ds, int dp, int sp, int sx, int sy, int dx, int dy, unsigned lwmask) { int i, j; unsigned *ls, *ld; int nw = dw >> 5; int lwbits = dw & 31; unsigned *psd = ds + sp * sy + (sx >> 5); unsigned *pdd = dd + dp * dy + (dx >> 5); for (i = 0; i < dh; i++) { ls = psd + i * sp; ld = pdd + i * dp; for (j = 0; j < nw; j++) { *ld = (*ls & *ld); ld++; ls++; } } } ------------------------------------ For the memcheck of innerloop vectorization, we expect the code like this: ls = psd + i * sp; ld = pdd + i * dp; if (ls < ld + nw || ld < ls + nw) goto conflict. However, some computation defining psd and pdd are regenerated inside preheader of innerloop, and these redundent computations cannot be cleaned up in later pass. for.body: ; preds = %for.inc21, %for.body.lr.ph %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc21 ] %10 = mul i64 %0, %indvars.iv %11 = add i64 %5, %10 %scevgep = getelementptr i32, i32* %dd, i64 %11 ==> The add involving %dd has been computed outside of loop when defining pdd. why not directly use pdd here. %scevgep51 = bitcast i32* %scevgep to i8* %12 = add i64 %7, %10 %scevgep52 = getelementptr i32, i32* %dd, i64 %12 ==> The add involving %dd has been computed outside of loop when defininng pdd. why not directly use pdd here. %scevgep5253 = bitcast i32* %scevgep52 to i8* %13 = mul i64 %1, %indvars.iv %14 = add i64 %8, %13 %scevgep54 = getelementptr i32, i32* %ds, i64 %14 ==> The add involving %ds has been computed outside of loop when defining psd. why not directly use psd here. %scevgep5455 = bitcast i32* %scevgep54 to i8* %15 = add i64 %9, %13 %scevgep56 = getelementptr i32, i32* %ds, i64 %15 ==> The add involving %ds has been computed outside of loop when defining psd. why not directly use psd here. %scevgep5657 = bitcast i32* %scevgep56 to i8* br i1 %cmp1742, label %for.body18.preheader, label %for.inc21 ... vector.memcheck: ; preds = %min.iters.checked %bound0 = icmp ule i8* %scevgep51, %scevgep5657 %bound1 = icmp ule i8* %scevgep5455, %scevgep5253 %found.conflict = and i1 %bound0, %bound1 %memcheck.conflict = and i1 %found.conflict, true Final assembly for the memcheck: # BB#5: # %vector.memcheck # in Loop: Header=BB0_2 Depth=1 movq %r9, %r10 movq %r13, %r9 movq %rbp, %r13 movq -48(%rsp), %rdx # 8-byte Reload leaq (%rdx,%rbx), %rbp leaq (%r13,%rbp,4), %r15 movq -56(%rsp), %rdx # 8-byte Reload leaq (%rdx,%rax), %rdx movq -104(%rsp), %rbp # 8-byte Reload leaq (%rbp,%rdx,4), %rdx cmpq %rdx, %r15 movq -104(%rsp), %rbp # 8-byte Reload ja .LBB0_8 # BB#6: # %vector.memcheck # in Loop: Header=BB0_2 Depth=1 addq -72(%rsp), %rbx # 8-byte Folded Reload leaq (%r13,%rbx,4), %rdx addq -64(%rsp), %rax # 8-byte Folded Reload leaq (%rbp,%rax,4), %rax cmpq %rdx, %rax ja .LBB0_8 When the iteration number of innerloop is not large enough, the preparation code for vectorization/unroll in the preheader of innerloop matters. Such preparation code to compute loop iteration number or generate runtime check is often generated by SCEV expansion. Here the redundency exists because SCEV expansion doesn't reuse existing value psd and pdd. * D12090 and D21313 relieved the reuse problem of SCEV expansion somewhat, but why they are defeated here? New SCEV expr is created after decomposition and combination of original SCEVs, without those SCEVs appearing inside of the new SCEV expr. An example, an old SCEVAddExpr S which is {a + b} has an existing value v = a + b associated with it. When we generate SCEV for expr S + c, a new SCEVAddExpr with three operands will be created {a + b + c} and no S can be found in the new SCEVAddExpr. When we expand the new SCEV, it will only generate a + b + c instead of v + c. * Why such redundencies cannot be cleaned up by later passes? SCEV transformation can make the expanded expr a lot different from the expr to be reused, for example, SCEV expansion wants to turn things like ptrtoint+arithmetic+inttoptr into GEP so expr may be aggressively reassociated. After that, enabling extra-vectorizer-passes and even NaryReassociate cannot clean up the redundencies. To solve the problem, adding more facility to enhance the reuse during SCEV expansion or enhancing cleanup passes, I am wondering which way is better to persue. -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs