On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <al...@redhat.com> wrote: > Hi Richard. > > As discussed in the PR, the problem here is that we have two different > iterations of an IV live outside of a loop. This inhibits us from using > autoinc/dec addressing on ARM, and causes extra lea's on x86. > > An abbreviated example is this: > > loop: > # p_9 = PHI <p_17(2), p_20(3)> > p_20 = p_9 + 18446744073709551615; > goto loop > p_24 = p_9 + 18446744073709551614; > MEM[(char *)p_20 + -1B] = 45; > > Here we have both the previous IV (p_9) and the current IV (p_20) used > outside of the loop. On Arm this keeps us from using auto-dec addressing, > because one use is -2 and the other one is -1. > > With the attached patch we attempt to rewrite out-of-loop uses of the IV in > terms of the current/last IV (p_20 in the case above). With it, we end up > with: > > p_24 = p_20 + 18446744073709551615; > *p_24 = 45; > > ...which helps both x86 and Arm. > > As you have suggested in comment 38 on the PR, I handle specially > out-of-loop IV uses of the form IV+CST and propagate those accordingly > (along with the MEM_REF above). Otherwise, in less specific cases, we un-do > the IV increment, and use that value in all out-of-loop uses. For instance, > in the attached testcase, we rewrite: > > george (p_9); > > into > > _26 = p_20 + 1; > ... > george (_26); > > The attached testcase tests the IV+CST specific case, as well as the more > generic case with george(). > > Although the original PR was for ARM, this behavior can be noticed on x86, > so I tested on x86 with a full bootstrap + tests. I also ran the specific > test on an x86 cross ARM build and made sure we had 2 auto-dec with the > test. For the original test (slightly different than the testcase in this > patch), with this patch we are at 104 bytes versus 116 without it. There is > still the issue of a division optimization which would further reduce the > code size. I will discuss this separately as it is independent from this > patch. > > Oh yeah, we could make this more generic, and maybe handle any multiple of > the constant, or perhaps *= and /=. Perhaps something for next stage1... > > OK for trunk? Just FYI, this looks similar to what I did in https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html That change was non-trivial and didn't give obvious improvement back in time. But I still wonder if this can be done at rewriting iv_use in a light-overhead way.
Thanks, bin > Aldy