On Thu, Nov 18, 2021 at 4:18 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Richard, > Many thanks for the patch review. > > On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote: > > On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle > > <ro...@nextmovesoftware.com> wrote: > > > > > > This patch tidies up the code that GCC generates for simple loops, by > > > selecting/generating a simpler loop bound expression in ivopts. > > > Previously: > > > > > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) { > > > int i; > > > for (i = 0; i < n; i++) { > > > out[i] = in[i]; > > > } > > > } > > > > > > on x86_64 generated: > > > v1: testl %edx, %edx > > > je .L1 > > > movl %edx, %edx > > > xorl %eax, %eax > > > .L3: movq (%rdi,%rax,8), %rcx > > > movq %rcx, (%rsi,%rax,8) > > > addq $1, %rax > > > cmpq %rax, %rdx > > > jne .L3 > > > .L1: ret > > > > > > and now instead generates: > > > v1: testl %edx, %edx > > > je .L1 > > > movl %edx, %edx > > > xorl %eax, %eax > > > leaq 0(,%rdx,8), %rcx > > > .L3: movq (%rdi,%rax), %rdx > > > movq %rdx, (%rsi,%rax) > > > addq $8, %rax > > > cmpq %rax, %rcx > > > jne .L3 > > > .L1: ret > > > > Is that actually better? IIRC the addressing modes are both complex and we > > now have an extra lea? > > > Technically the induction variable elimination is removing two multiplies by 8 > (or left shifts) from the body of the loop and replacing them with a single > multiply by 8 prior to the loop, which is exactly the induction variable > optimization that ivopts is designed to do. It's true that with x86's complex > addressing modes these "multiplications" are free, but ivopts is run on all > targets including those without indexed addressing, and even on x86 there's > a benefit from shorter instruction encodings. > > > For this case I see we generate > > _15 = n_10(D) + 4294967295; > > _8 = (unsigned long) _15; > > _7 = _8 + 1; > > > > where n is unsigned int so if we know that n is not zero we can simplify the > > addition and conveniently the loop header test provides this guarantee. > > IIRC there were some attempts to enhance match.pd for some cases of such > > expressions. > > Exactly. The loop optimizers are generating the expression (x-1)+1 and > assuming that the middle-end will simplify this to x. Unfortunately, the > change of modes (designed to prevent overflow) actually makes this > impossible to simplify, due to concerns about overflow. > > > + /* If AFTER_ADJUST is required, the code below generates the equivalent > > + * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression > > + * BASE + (NITER + 1) * STEP, especially when NITER is often of the form > > + * SSA_NAME - 1. Unfortunately, guaranteeing that adding 1 to NITER > > + * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC > > + * class for common idioms that we know are safe. */ > > > > No '* ' each line. > Doh! Thanks. Sometimes I hate vi. > > > I wonder if the non-overflowing can be captured by > > integer_onep (iv->step) > > && max_stmt_executions (loop, &max) > > Unfortunately, max_stmt_executions is intended to return the wide_int count > of iterations, either the known constant value or a profile-based estimate, > while > the optimizations I'm proposing work with symbolic values from variable > iteration > counts. When the iteration count is a constant, fold-const is already able to > simplify (x-1)+1 to an integer constant even with type conversions. > > > if we then do (niter + 1) * step instead of niter*step + step would that do > > the > > same? > > Yes. I've extended the scope of the patch to now also handle loops of the > form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we > now use (end-beg) as the incremented niter, so the invariant expression now > becomes (end-beg)*4 instead of the currently generated: ((end-beg)-1)*4 + 4. > > I'm assuming that the niter*step + step is by design (for the general case), > so I'm only tweaking the (common) corner cases, where it's easy to see that > it's safe to substitute a simpler expression. For more general affine > recurrences > in complex loop nests, niter*step + step may be preferred. > > > That said - what the change does is actually ensure that we CSE niter + 1 > > with the bound of the simplified exit test? > > Not quite, this simply provides a simplified expression for "niter + 1" that > takes advantage of the implicit range information we have. You're right > in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L > more generally. In this case, I suspect it's just a historical artifact that > much of the literature on (iv) loop optimizations dates back to Fortran > and Algol where arrays were 1-based rather than 0-based, so decades > later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction > for purely historical reasons. > > > The patch doesn't add any testcase. > > The three new attached tests check that the critical invariants have a > simpler form, and hopefully shouldn't be affected by whether the optimizer > and/or backend costs actually decide to perform this iv substitution or not.
The testcases might depend on lp64 though, did you test them with -m32? IMHO it's fine to require lp64 here. I'm a bit unsure about adding this special-casing in cand_value_at in general - it does seem that we're doing sth wrong elsewhere - either by not simplifying even though enough knowledge is there or by throwing away knowledge earlier (during niter analysis?). Anyway, the patch does look quite safe - can you show some statistics in how many time there's extra simplification this way during say bootstrap? Thanks, Richard. > > Thanks, > > Richard. > > > This revised patch has been tested on x86_64-pc-linux-gnu with a > make bootstrap and make -k check with no new failures. > > 2021-11-18 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * tree-ssa-loop-ivopts.c (cand_value_at): Take a class > tree_niter_desc* argument instead of just a tree for NITER. > If we require the iv candidate value at the end of the final > loop iteration, try using the original loop bound as the > NITER for sufficiently simple loops. > (may_eliminate_iv): Update (only) call to cand_value_at. > > gcc/testsuite > * gcc.dg/tree-ssa/ivopts-5.c: New test case. > * gcc.dg/tree-ssa/ivopts-6.c: New test case. > * gcc.dg/tree-ssa/ivopts-7.c: New test case. > * gcc.dg/wrapped-binop-simplify.c: Update expected test result. > > > Many thanks in advance (fingers-crossed this is still > suitable/grandfathered-in > at the start of Stage 3). > > Roger > -- >