On Mon, Jun 23, 2014 at 11:49 AM, Bin Cheng <bin.ch...@arm.com> wrote: > Hi, > For below simplified case: > > #define LEN (32000) > __attribute__((aligned(16))) float a[LEN],b[LEN]; > > int foo (int M) > { > for (int i = 0; i < M; i++) > a[i+M] = a[i] + b[i]; > } > > Compiling it with command like: > $ aarch64-elf-gcc -O3 -S foo.c -o foo.S -std=c99 > > The assembly code of vectorized loop is in below form: > mov x1, 0 > mov w2, 0 > .L4: > ldr q0, [x1, x3] > add w2, w2, 1 > ldr q1, [x1, x4] > cmp w2, w5 > fadd v0.4s, v0.4s, v1.4s > str q0, [x6, x1] > add x1, x1, 16 > bcc .L4 > > Induction variable w2 is unnecessary and can be eliminated with x1. This is > safe because X1 will never overflow during all iterations of loop. The > optimal assembly should be like: > mov x1, 0 > .L4: > ldr q0, [x1, x2] > ldr q1, [x1, x4] > fadd v0.4s, v0.4s, v1.4s > str q0, [x5, x1] > add x1, x1, 16 > cmp x1, x3 > bcc .L4 > > This case can be handled if we do more complex overflow check on unsigned > type in function iv_elimination_compare_lt. > > Also there is another blocker for the transformation, function > number_of_iterations_lt calls fold_build2 to build folded form of > may_be_zero, while iv_elimination_compare_lt only handles simple form tree > expressions. It's possible to have iv_elimination_compare_lt to do some > undo transformation on may_be_zero, but I found it's difficult for cases > involving signed/unsigned conversion like case loop-41.c. Since I think > there is no obvious benefit to fold may_be_zero here (somehow because the > two operands are already in folded forms), this patch just calls build2_loc > instead.
But it may fold to true/false, no? > This optimization is picked up by patch B, but patch A is necessary since > the check in iv_elimination_compare_lt of two aff_trees isn't enough when > two different types (one in signed type, the other in unsigned) involved. I > have to use tree comparison here instead. Considering below simple case: > > Analyzing # of iterations of loop 5 > exit condition [1, + , 1](no_overflow) <= i_1 > bounds on difference of bases: -3 ... 1 > result: > zero if i_1 + 1 < 1 > # of iterations (unsigned int) i_1, bounded by 2 > number of iterations (unsigned int) i_1; zero if i_1 + 1 < 1 > > use 0 > compare > in statement if (S.7_9 > i_1) > > at position > type integer(kind=4) > base 1 > step 1 > is a biv > related candidates > > candidate 0 (important) > var_before ivtmp.28 > var_after ivtmp.28 > incremented at end > type unsigned int > base 0 > step 1 > > When GCC trying to eliminate use 0 with cand 0, the miscellaneous trees in > iv_elimination_compare_lt are like below with i_1 of signed type: > B: i_1 + 1 > A: 0 > niter->niter: (unsigned int)i_1 > > Apparently, (B-A-1) is "i_1", which doesn't equal to "(unsigned int)i_1". > Without this patch, it is considered equal to each other. just looking at this part. Do you have a testcase that exhibits a difference when just applying patch A? So I can have a look here? >From the code in iv_elimination_compare_lt I can't figure why we'd end up with i_1 instead of (unsigned int)i_1 as we convert to a common type. I suppose the issue may be that at tree_to_aff_combination time we strip all nops with STRIP_NOPS but when operating on ->rest via convert/scale or add we do not strip them again. But then 'nit' should be i_1, not (unsigned int)i_1. So the analysis above really doesn't look correct. Just to make sure we don't paper over an issue in tree-affine.c. Thus - testcase? On x86 we don't run into this place in iv_elimination_compare_lt (on an unpatched tree). CCing Zdenek for the meat of patch B. Thanks, Richard. > Note that the induction variable IS necessary on 32 bit systems since > otherwise there is type overflow. > > These two patch fix the mentioned problem. > They pass bootstrap and regression test on x86_64/x86/aarch64/arm, so any > comments? > > Thanks, > bin > > PATCH A) > > 2014-06-23 Bin Cheng <bin.ch...@arm.com> > > * tree-ssa-loop-ivopts.c (iv_elimination_compare_lt): Check number > of iteration using tree comparison. > > PATCH B) > > 2014-06-23 Bin Cheng <bin.ch...@arm.com> > > * tree-ssa-loop-niter.c (number_of_iterations_lt): Build unfolded > form of may_be_zero. > * tree-ssa-loop-ivopts.c (iv_nowrap_period) > (nowrap_cand_for_loop_niter_p): New functions. > (period_greater_niter_exit): New function refactored from > may_eliminate_iv. > (iv_elimination_compare_lt): New parameter. Relax overflow check. > Handle special forms may_be_zero expression. > (may_eliminate_iv): Call period_greater_niter_exit. Pass new > argument for iv_elimination_compare_lt. > > gcc/testsuite/ChangeLog > 2014-06-23 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/loop-40.c: New test. > * gcc.dg/tree-ssa/loop-41.c: New test.