On Wed, Mar 28, 2018 at 1:37 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Mar 20, 2018 at 2:36 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Mon, Mar 19, 2018 at 6:08 PM, Aldy Hernandez <al...@redhat.com> wrote: >>> Hi Richard. >>> >>> As discussed in the PR, the problem here is that we have two different >>> iterations of an IV live outside of a loop. This inhibits us from using >>> autoinc/dec addressing on ARM, and causes extra lea's on x86. >>> >>> An abbreviated example is this: >>> >>> loop: >>> # p_9 = PHI <p_17(2), p_20(3)> >>> p_20 = p_9 + 18446744073709551615; >>> goto loop >>> p_24 = p_9 + 18446744073709551614; >>> MEM[(char *)p_20 + -1B] = 45; >>> >>> Here we have both the previous IV (p_9) and the current IV (p_20) used >>> outside of the loop. On Arm this keeps us from using auto-dec addressing, >>> because one use is -2 and the other one is -1. >>> >>> With the attached patch we attempt to rewrite out-of-loop uses of the IV in >>> terms of the current/last IV (p_20 in the case above). With it, we end up >>> with: >>> >>> p_24 = p_20 + 18446744073709551615; >>> *p_24 = 45; >>> >>> ...which helps both x86 and Arm. >>> >>> As you have suggested in comment 38 on the PR, I handle specially >>> out-of-loop IV uses of the form IV+CST and propagate those accordingly >>> (along with the MEM_REF above). Otherwise, in less specific cases, we un-do >>> the IV increment, and use that value in all out-of-loop uses. For instance, >>> in the attached testcase, we rewrite: >>> >>> george (p_9); >>> >>> into >>> >>> _26 = p_20 + 1; >>> ... >>> george (_26); >>> >>> The attached testcase tests the IV+CST specific case, as well as the more >>> generic case with george(). >>> >>> Although the original PR was for ARM, this behavior can be noticed on x86, >>> so I tested on x86 with a full bootstrap + tests. I also ran the specific >>> test on an x86 cross ARM build and made sure we had 2 auto-dec with the >>> test. For the original test (slightly different than the testcase in this >>> patch), with this patch we are at 104 bytes versus 116 without it. There is >>> still the issue of a division optimization which would further reduce the >>> code size. I will discuss this separately as it is independent from this >>> patch. >>> >>> Oh yeah, we could make this more generic, and maybe handle any multiple of >>> the constant, or perhaps *= and /=. Perhaps something for next stage1... >>> >>> OK for trunk? >> >> Sorry for noticing so late but you use loop properties like ->latch and >> ->loop_father (via flow_bb_inside_loop_p) that may not be valid at >> this point given expand doesn't do loop_optimizer_init/finalize. Generally >> you may face loops with multiple latches (->latch == NULL) here and >> loops may have multiple exits. You probably are not hitting any >> of those problems because is_iv_plus_constant is quite restrictive >> (doesn't handle PLUS_EXPR or MINUS_EXPR). >> >> Also the latch doesn't need to be the block with the exit conditional. >> >> So first of all you'd need to wrap insert_backedge_copies () with >> >> loop_optimizer_init (AVOID_CFG_MODIFICATIONS); >> ... >> loop_optimizer_finalize (); >> >> that is required to fixup an eventually broken state. Then you >> probably should restrict handling to PHIs in BBs with >> bb->loop_father->header == bb (aka loop header PHIs), >> use get_loop_exit_edges when the IV increment is of OK form >> and then use dominator info to query in which exit block to >> insert compensation (or simply refuse to do anything for >> loops with multiple exits). >> >> So if you have more cycles to work on this that would be nice, >> otherwise I can certainly take over from here... > > Ok, so I've finally taken a closer look. There's a thing that goes > wrong unpatched with the current insert_backedge_copies code. > It does detect the coalescing conflict but instead of solving it > in a way so the backedge copy can be coalesced it inserts a > copy on that edge (actually before the exit test). > It could have done better by inserting > a copy before the IV increment and adjusting out-of-loop uses > to use that. There's also another IV the existing code triggers on > un-helpfully inserting a copy between two uses (exit test and > producer of the backedge value) instead of sinking the producer > after the last use (splitting the latch edge) (doesn't help on x86 > where we combine div and mod later, recreating the conflict).
So here's the attempt at fixing the existing code. Basically the idea is that instead of inserting the required backedge copy on the backedge (without splitting it, so in its source), insert it before the definition of the backedge value and adjust downstream uses that would cause the coalescing conflict by the copy destination. For the testcase in question where the previous code worked for neither of the two IVs it now works for both and on x86 we get .L2: movl %ecx, %eax xorl %edx, %edx movq %rbx, %rbp decq %rbx divl %esi addl $48, %edx movb %dl, (%rbx) movl %ecx, %edx movl %eax, %ecx cmpl $9, %edx ja .L2 note there's no latch (and thus no extra jump) but extra copies inside of the loop, compared to before .L2: movl %ecx, %eax xorl %edx, %edx leaq -1(%rbx), %rbp divl %esi addl $48, %edx movb %dl, -1(%rbx) cmpl $9, %ecx jbe .L7 movq %rbp, %rbx movl %eax, %ecx jmp .L2 where there's the extra jump (and two copies in the new latch block). Overall code-size improvement isn't great, just two bytes on x86 and 8 bytes on arm. On arm we manage to use a post-inc/dec strb inside of the loop but not outside. And of course the latch block is gone: .L2: mov r0, r5 mov r1, #10 bl __aeabi_uidivmod mov r3, r5 add r1, r1, #48 mov r6, r4 strb r1, [r4, #-1]! umull r0, r1, r5, r8 cmp r3, #9 lsr r5, r1, #3 bhi .L2 compared to .L2: mov r1, #10 mov r0, r6 bl __aeabi_uidivmod umull r2, r3, r6, r8 add r1, r1, #48 cmp r6, #9 sub r5, r4, #1 strb r1, [r4, #-1] lsr r3, r3, #3 bhi .L4 ... ouch, end of function ... .L4: mov r4, r5 mov r6, r3 b .L2 the new placement of the copy obviously doesn't work for PHI defs nor for constants or default defs so that's what I keep the old code for. The old code also didn't work for default-defs because trivially_conflicts_p is too trivial. The SSA_NAME_VAR (arg) != SSA_NAME_VAR (result) case should be obsolete by now. With this I'm not sure where to take this from here - certainly avoiding the pre-inc value to be used outside of the loop (or even inside it!) by undoing the increment for the uses would be an improvement but given it's late for GCC 8 and the regression is quite old I'd rather postpone this to GCC 9. I also think fixing up the broken existing compensation code should be done first, so, Jeff, any opinion on the above and the patch? IIRC you introduced this code. Thanks, Richard.
fix-pr70359-2
Description: Binary data