On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>
>
>
> On 2021/6/25 18:02, Richard Biener wrote:
> > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 2021/6/25 16:54, Richard Biener wrote:
> >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> >>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>> From: Xiong Hu Luo <luo...@linux.ibm.com>
> >>>>
> >>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
> >>>> on Power.  For example, it generates mismatched address offset after
> >>>> adjust iv update statement position:
> >>>>
> >>>> <bb 32> [local count: 70988443]:
> >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>> _34 = ref_180 + 18446744073709551615;
> >>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>>> if (_84 == _86)
> >>>>     goto <bb 56>; [94.50%]
> >>>>     else
> >>>>     goto <bb 87>; [5.50%]
> >>>>
> >>>> Disable it will produce:
> >>>>
> >>>> <bb 32> [local count: 70988443]:
> >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> >>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> >>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> >>>> if (_84 == _86)
> >>>>     goto <bb 56>; [94.50%]
> >>>>     else
> >>>>     goto <bb 87>; [5.50%]
> >>>>
> >>>> Then later pass loop unroll could benefit from same address offset
> >>>> with different base address and reduces register dependency.
> >>>> This patch could improve performance by 10% for typical case on Power,
> >>>> no performance change observed for X86 or Aarch64 due to small loops
> >>>> not unrolled on these platforms.  Any comments?
> >>>
> >>> The case you quote is special in that if we hoisted the IV update before
> >>> the other MEM _also_ used in the condition it would be fine again.
> >>
> >> Thanks.  I tried to hoist the IV update statement before the first MEM 
> >> (Fix 2), it
> >> shows even worse performance due to not unroll(two more "base-1" is 
> >> generated in gimple,
> >> then loop->ninsns is 11 so small loops is not unrolled), change the 
> >> threshold from
> >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, 
> >> the
> >> performance is SAME to the one that IV update statement in the *MIDDLE* 
> >> (trunk).
> >>  From the ASM, we can see the index register %r4 is used in two iterations 
> >> which
> >> maybe a bottle neck for hiding instruction latency?
> >>
> >> Then it seems reasonable the performance would be better if keep the IV 
> >> update
> >> statement at *LAST* (Fix 1).
> >>
> >> (Fix 2):
> >>    <bb 32> [local count: 70988443]:
> >>    ivtmp.30_415 = ivtmp.30_414 + 1;
> >>    _34 = ip_229 + 18446744073709551615;
> >>    _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> >>    _33 = ref_180 + 18446744073709551615;
> >>    _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> >>    if (_84 == _86)
> >>      goto <bb 56>; [94.50%]
> >>    else
> >>      goto <bb 87>; [5.50%]
> >>
> >>
> >> .L67:
> >>          lbzx %r12,%r24,%r4
> >>          lbzx %r25,%r7,%r4
> >>          cmpw %cr0,%r12,%r25
> >>          bne %cr0,.L11
> >>          mr %r26,%r4
> >>          addi %r4,%r4,1
> >>          lbzx %r12,%r24,%r4
> >>          lbzx %r25,%r7,%r4
> >>          mr %r6,%r26
> >>          cmpw %cr0,%r12,%r25
> >>          bne %cr0,.L11
> >>          mr %r26,%r4
> >> .L12:
> >>          cmpdi %cr0,%r10,1
> >>          addi %r4,%r26,1
> >>          mr %r6,%r26
> >>          addi %r10,%r10,-1
> >>          bne %cr0,.L67
> >>
> >>>
> >>> Now, adjust_iv_update_pos doesn't seem to check that the
> >>> condition actually uses the IV use stmt def, so it likely applies to
> >>> too many cases.
> >>>
> >>> Unfortunately the introducing rev didn't come with a testcase,
> >>> but still I think fixing up adjust_iv_update_pos is better than
> >>> introducing a way to short-cut it per target decision.
> >>>
> >>> One "fix" might be to add a check that either the condition
> >>> lhs or rhs is the def of the IV use and the other operand
> >>> is invariant.  Or if it's of similar structure hoist across the
> >>> other iv-use as well.  Not that I understand the argument
> >>> about the overlapping life-range.
> >>>
> >>> You also don't provide a complete testcase ...
> >>>
> >>
> >> Attached the test code, will also add it it patch in future version.
> >> The issue comes from a very small hot loop:
> >>
> >>          do {
> >>            len++;
> >>          } while(len < maxlen && ip[len] == ref[len]);
> >
> > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int 
> > maxlen)
> > {
> >    unsigned int len = 2;
> >    do {
> >        len++;
> >    }while(len < maxlen && ip[len] == ref[len]);
> >    return len;
> > }
> >
> > I can see the effect on this loop on x86_64 as well, we end up with
> >
> > .L6:
> >          movzbl  (%rdi,%rax), %ecx
> >          addq    $1, %rax
> >          cmpb    -1(%rsi,%rax), %cl
> >          jne     .L1
> > .L3:
> >          movl    %eax, %r8d
> >          cmpl    %edx, %eax
> >          jb      .L6
> >
> > but without the trick it is
> >
> > .L6:
> >          movzbl  (%rdi,%rax), %r8d
> >          movzbl  (%rsi,%rax), %ecx
> >          addq    $1, %rax
> >          cmpb    %cl, %r8b
> >          jne     .L1
> > .L3:
> >          movl    %eax, %r9d
> >          cmpl    %edx, %eax
> >          jb      .L6
>
> Verified this small piece of code on X86, there is no performance
> change with or without adjust_iv_update_pos (I've checked the ASM
> exactly same with yours):
>
> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>
> real    0m7.003s
> user    0m6.996s
> sys     0m0.000s
> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ 
> /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
> cc/ -O2 test.c
> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>
> real    0m7.070s
> user    0m7.068s
> sys     0m0.000s
>
>
> But for AArch64, current GCC code also generates similar code with
> or without adjust_iv_update_pos, the runtime is 10.705s for them.
>
> L6:
>         ldrb    w4, [x6, x3]
>         add     x3, x3, 1
>         ldrb    w5, [x1, x3]
>         cmp     w5, w4
>         bne     .L1
> .L3:
>         mov     w0, w3
>         cmp     w2, w3
>         bhi     .L6
>
> No adjust_iv_update_pos:
>
> .L6:
>         ldrb    w5, [x6, x3]
>         ldrb    w4, [x1, x3]
>         add     x3, x3, 1
>         cmp     w5, w4
>         bne     .L1
> .L3:
>         mov     w0, w3
>         cmp     w2, w3
>         bhi     .L6
>
>
> While built with old GCC(gcc version 7.4.1 20190424), it generates
> worse code and runtime is 11.664s:
>
> .L6:
>         ldrb    w4, [x6, x3]
>         add     x3, x3, 1
>         add     x5, x1, x3
>         ldrb    w5, [x5, -1]
>         cmp     w5, w4
>         bne     .L1
> .L3:
>         cmp     w3, w2
>         mov     w0, w3
>         bcc     .L6
>
>
> In general, only Power shows negative performance with adjust_iv_update_pos,
> that's why I tried to add target hook for it, is this reasonable?  Or we 
> should
> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
> other targets?

It was Bin who added the functionality - maybe he can remember the cases
he saw improvements for.

I think we should instead try to change the code that it doesn't apply to
CONDs where both operands are defined by stmts using the IV?
And this time add a testcase ;)

Richard.

>
>
> test.c
> #include <stdlib.h>
> __attribute__((noinline)) unsigned int
> foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
> {
>         unsigned int len = 2;
>         do {
>                 len++;
>         }while(len < maxlen && ip[len] == ref[len]);
>         return len;
> }
>
> int main (int argc, char* argv[])
> {
>   unsigned char string_a [] = 
> "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeeee";
>   unsigned char string_b [] = 
> "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeene";
>   unsigned long ret = 0;
>
>   for (long i = 0; i < atoi(argv[1]) * 100000000; i++)
>           ret += foo (string_a, string_b, sizeof(string_a));
>   return ret;
> }
>
>
> >
> > so here you can see the missed fusion.  Of course
> > in this case the IV update could have been sunk into
> > the .L3 block and replicated on the exit edge as well.
> >
> > I'm not sure if the motivation for the change introducing this
> > trick was the above kind of combination or not, but I guess
> > so.  The dependence distance of the IV increment to the
> > use is now shorter, so I'm not sure the combined variant is
> > better.
> >
> > Richard.
> >
> >
> >>
> >> --
> >> Thanks,
> >> Xionghu
>
> --
> Thanks,
> Xionghu

Reply via email to