On Mon, Jul 12, 2021 at 11:00 AM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Mon, Jul 12, 2021 at 4:14 PM Xionghu Luo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> >
> >
> > On 2021/7/7 21:20, Richard Biener wrote:
> > > On Tue, Jun 29, 2021 at 11:19 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2021/6/28 16:25, Richard Biener wrote:
> > >>> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luo...@linux.ibm.com> 
> > >>> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2021/6/25 18:02, Richard Biener wrote:
> > >>>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luo...@linux.ibm.com> 
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 2021/6/25 16:54, Richard Biener wrote:
> > >>>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
> > >>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> > >>>>>>>>
> > >>>>>>>> From: Xiong Hu Luo <luo...@linux.ibm.com>
> > >>>>>>>>
> > >>>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help 
> > >>>>>>>> performance
> > >>>>>>>> on Power.  For example, it generates mismatched address offset 
> > >>>>>>>> after
> > >>>>>>>> adjust iv update statement position:
> > >>>>>>>>
> > >>>>>>>> <bb 32> [local count: 70988443]:
> > >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> > >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> > >>>>>>>> _34 = ref_180 + 18446744073709551615;
> > >>>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> > >>>>>>>> if (_84 == _86)
> > >>>>>>>>       goto <bb 56>; [94.50%]
> > >>>>>>>>       else
> > >>>>>>>>       goto <bb 87>; [5.50%]
> > >>>>>>>>
> > >>>>>>>> Disable it will produce:
> > >>>>>>>>
> > >>>>>>>> <bb 32> [local count: 70988443]:
> > >>>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
> > >>>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
> > >>>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
> > >>>>>>>> if (_84 == _86)
> > >>>>>>>>       goto <bb 56>; [94.50%]
> > >>>>>>>>       else
> > >>>>>>>>       goto <bb 87>; [5.50%]
> > >>>>>>>>
> > >>>>>>>> Then later pass loop unroll could benefit from same address offset
> > >>>>>>>> with different base address and reduces register dependency.
> > >>>>>>>> This patch could improve performance by 10% for typical case on 
> > >>>>>>>> Power,
> > >>>>>>>> no performance change observed for X86 or Aarch64 due to small 
> > >>>>>>>> loops
> > >>>>>>>> not unrolled on these platforms.  Any comments?
> > >>>>>>>
> > >>>>>>> The case you quote is special in that if we hoisted the IV update 
> > >>>>>>> before
> > >>>>>>> the other MEM _also_ used in the condition it would be fine again.
> > >>>>>>
> > >>>>>> Thanks.  I tried to hoist the IV update statement before the first 
> > >>>>>> MEM (Fix 2), it
> > >>>>>> shows even worse performance due to not unroll(two more "base-1" is 
> > >>>>>> generated in gimple,
> > >>>>>> then loop->ninsns is 11 so small loops is not unrolled), change the 
> > >>>>>> threshold from
> > >>>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 
> > >>>>>> times, the
> > >>>>>> performance is SAME to the one that IV update statement in the 
> > >>>>>> *MIDDLE* (trunk).
> > >>>>>>    From the ASM, we can see the index register %r4 is used in two 
> > >>>>>> iterations which
> > >>>>>> maybe a bottle neck for hiding instruction latency?
> > >>>>>>
> > >>>>>> Then it seems reasonable the performance would be better if keep the 
> > >>>>>> IV update
> > >>>>>> statement at *LAST* (Fix 1).
> > >>>>>>
> > >>>>>> (Fix 2):
> > >>>>>>      <bb 32> [local count: 70988443]:
> > >>>>>>      ivtmp.30_415 = ivtmp.30_414 + 1;
> > >>>>>>      _34 = ip_229 + 18446744073709551615;
> > >>>>>>      _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
> > >>>>>>      _33 = ref_180 + 18446744073709551615;
> > >>>>>>      _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
> > >>>>>>      if (_84 == _86)
> > >>>>>>        goto <bb 56>; [94.50%]
> > >>>>>>      else
> > >>>>>>        goto <bb 87>; [5.50%]
> > >>>>>>
> > >>>>>>
> > >>>>>> .L67:
> > >>>>>>            lbzx %r12,%r24,%r4
> > >>>>>>            lbzx %r25,%r7,%r4
> > >>>>>>            cmpw %cr0,%r12,%r25
> > >>>>>>            bne %cr0,.L11
> > >>>>>>            mr %r26,%r4
> > >>>>>>            addi %r4,%r4,1
> > >>>>>>            lbzx %r12,%r24,%r4
> > >>>>>>            lbzx %r25,%r7,%r4
> > >>>>>>            mr %r6,%r26
> > >>>>>>            cmpw %cr0,%r12,%r25
> > >>>>>>            bne %cr0,.L11
> > >>>>>>            mr %r26,%r4
> > >>>>>> .L12:
> > >>>>>>            cmpdi %cr0,%r10,1
> > >>>>>>            addi %r4,%r26,1
> > >>>>>>            mr %r6,%r26
> > >>>>>>            addi %r10,%r10,-1
> > >>>>>>            bne %cr0,.L67
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Now, adjust_iv_update_pos doesn't seem to check that the
> > >>>>>>> condition actually uses the IV use stmt def, so it likely applies to
> > >>>>>>> too many cases.
> > >>>>>>>
> > >>>>>>> Unfortunately the introducing rev didn't come with a testcase,
> > >>>>>>> but still I think fixing up adjust_iv_update_pos is better than
> > >>>>>>> introducing a way to short-cut it per target decision.
> > >>>>>>>
> > >>>>>>> One "fix" might be to add a check that either the condition
> > >>>>>>> lhs or rhs is the def of the IV use and the other operand
> > >>>>>>> is invariant.  Or if it's of similar structure hoist across the
> > >>>>>>> other iv-use as well.  Not that I understand the argument
> > >>>>>>> about the overlapping life-range.
> > >>>>>>>
> > >>>>>>> You also don't provide a complete testcase ...
> > >>>>>>>
> > >>>>>>
> > >>>>>> Attached the test code, will also add it it patch in future version.
> > >>>>>> The issue comes from a very small hot loop:
> > >>>>>>
> > >>>>>>            do {
> > >>>>>>              len++;
> > >>>>>>            } while(len < maxlen && ip[len] == ref[len]);
> > >>>>>
> > >>>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int 
> > >>>>> maxlen)
> > >>>>> {
> > >>>>>      unsigned int len = 2;
> > >>>>>      do {
> > >>>>>          len++;
> > >>>>>      }while(len < maxlen && ip[len] == ref[len]);
> > >>>>>      return len;
> > >>>>> }
> > >>>>>
> > >>>>> I can see the effect on this loop on x86_64 as well, we end up with
> > >>>>>
> > >>>>> .L6:
> > >>>>>            movzbl  (%rdi,%rax), %ecx
> > >>>>>            addq    $1, %rax
> > >>>>>            cmpb    -1(%rsi,%rax), %cl
> > >>>>>            jne     .L1
> > >>>>> .L3:
> > >>>>>            movl    %eax, %r8d
> > >>>>>            cmpl    %edx, %eax
> > >>>>>            jb      .L6
> > >>>>>
> > >>>>> but without the trick it is
> > >>>>>
> > >>>>> .L6:
> > >>>>>            movzbl  (%rdi,%rax), %r8d
> > >>>>>            movzbl  (%rsi,%rax), %ecx
> > >>>>>            addq    $1, %rax
> > >>>>>            cmpb    %cl, %r8b
> > >>>>>            jne     .L1
> > >>>>> .L3:
> > >>>>>            movl    %eax, %r9d
> > >>>>>            cmpl    %edx, %eax
> > >>>>>            jb      .L6
> > >>>>
> > >>>> Verified this small piece of code on X86, there is no performance
> > >>>> change with or without adjust_iv_update_pos (I've checked the ASM
> > >>>> exactly same with yours):
> > >>>>
> > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
> > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> > >>>>
> > >>>> real    0m7.003s
> > >>>> user    0m6.996s
> > >>>> sys     0m0.000s
> > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ 
> > >>>> /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
> > >>>> cc/ -O2 test.c
> > >>>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
> > >>>>
> > >>>> real    0m7.070s
> > >>>> user    0m7.068s
> > >>>> sys     0m0.000s
> > >>>>
> > >>>>
> > >>>> But for AArch64, current GCC code also generates similar code with
> > >>>> or without adjust_iv_update_pos, the runtime is 10.705s for them.
> > >>>>
> > >>>> L6:
> > >>>>           ldrb    w4, [x6, x3]
> > >>>>           add     x3, x3, 1
> > >>>>           ldrb    w5, [x1, x3]
> > >>>>           cmp     w5, w4
> > >>>>           bne     .L1
> > >>>> .L3:
> > >>>>           mov     w0, w3
> > >>>>           cmp     w2, w3
> > >>>>           bhi     .L6
> > >>>>
> > >>>> No adjust_iv_update_pos:
> > >>>>
> > >>>> .L6:
> > >>>>           ldrb    w5, [x6, x3]
> > >>>>           ldrb    w4, [x1, x3]
> > >>>>           add     x3, x3, 1
> > >>>>           cmp     w5, w4
> > >>>>           bne     .L1
> > >>>> .L3:
> > >>>>           mov     w0, w3
> > >>>>           cmp     w2, w3
> > >>>>           bhi     .L6
> > >>>>
> > >>>>
> > >>>> While built with old GCC(gcc version 7.4.1 20190424), it generates
> > >>>> worse code and runtime is 11.664s:
> > >>>>
> > >>>> .L6:
> > >>>>           ldrb    w4, [x6, x3]
> > >>>>           add     x3, x3, 1
> > >>>>           add     x5, x1, x3
> > >>>>           ldrb    w5, [x5, -1]
> > >>>>           cmp     w5, w4
> > >>>>           bne     .L1
> > >>>> .L3:
> > >>>>           cmp     w3, w2
> > >>>>           mov     w0, w3
> > >>>>           bcc     .L6
> > >>>>
> > >>>>
> > >>>> In general, only Power shows negative performance with 
> > >>>> adjust_iv_update_pos,
> > >>>> that's why I tried to add target hook for it, is this reasonable?  Or 
> > >>>> we should
> > >>>> just remove adjust_iv_update_pos since it doesn't help performance for 
> > >>>> X86 or
> > >>>> other targets?
> > >>>
> > >>> It was Bin who added the functionality - maybe he can remember the cases
> > >>> he saw improvements for.
> > >>>
> > >>> I think we should instead try to change the code that it doesn't apply 
> > >>> to
> > >>> CONDs where both operands are defined by stmts using the IV?
> > >>> And this time add a testcase ;)
> > >>
> > >> Thanks. I am not sure whether should check both cond_lhs and cond_rhs or 
> > >> just
> > >> cond_lhs is enough if COND expr's LHS is always processed first?
> > >>
> > >> For example, adjust_iv_update_pos will be called twice for this case,
> > >> after the first call, the gimple will be updated to:
> > >>
> > >> <bb 4> [local count: 1014686026]:
> > >> _1 = (sizetype) len_8;
> > >> _2 = ip_10(D) + _1;
> > >> _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
> > >> _5 = ref_12(D) + _1;
> > >> _6 = *_5;
> > >> ivtmp.15_16 = ivtmp.15_15 + 1;
> > >> if (_3 == _6)
> > >>    goto <bb 6>; [94.50%]
> > >> else
> > >>    goto <bb 10>; [5.50%]
> > >>
> > >>
> > >> At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?
> > >
> > > So looking around I think you want to pass down the IV group from
> > > rewrite_groups to rewrite_use_address and then see whether the
> > > definition of the RHS is one of the IV USE stmts that are going to be
> > > expressed using the same candidate.  In fact, I guess (to avoid
> > > linear searching the group), we likely might want to restrict considering
> > > adjust_iv_update_pos for IV candidates with a single [address] use?
> > > That would be more restrictive of course (OTOH the list shouldn't be
> > > too big - hopefully).  In case the other side of the conditional is
> > > a memory using a different IV the replacement should still be OK.
> > >
> > > Richard.
> >
> > Thanks, however those methods would still break the X86 optimization
> > with worse ASM(one more movzbl) though no performance change observed.
> > CCed X86 maintainers to see which one they prefer, and whether target
> > hook is a better solution?
> >
> >
> > GCC trunk:
> >
> > .L6:
> >           movzbl  (%rdi,%rax), %ecx
> >           addq    $1, %rax
> >           cmpb    -1(%rsi,%rax), %cl
> >           jne     .L1
> > .L3:
> >           movl    %eax, %r8d
> >           cmpl    %edx, %eax
> >           jb      .L6
> >
> > Disable adjust_iv_update_pos:
> >
> > .L6:
> >           movzbl  (%rdi,%rax), %r8d
> >           movzbl  (%rsi,%rax), %ecx
> >           addq    $1, %rax
> >           cmpb    %cl, %r8b
> >           jne     .L1
> > .L3:
> >           movl    %eax, %r9d
> >           cmpl    %edx, %eax
> >           jb      .L6
> >
> latency should be the same
> movzbl mem reg 5
> cmp reg8 reg8 1
> total 6
> vs
> cmp mem8 reg8 6
>
> Guess they're the same in uops level.

OTOH the dependence on the add on trunk
might tip it to the worse side, when adjust_iv_update_pos
is disabled the add can execute independently
(OTOH AGU + add might compete for the same resources)

Richard.

> CC uros, any opinion?
>
> >
> >
> > --
> > Thanks,
> > Xionghu
>
>
>
> --
> BR,
> Hongtao

Reply via email to