On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > > > > On 2021/6/25 18:02, Richard Biener wrote: > > On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luo...@linux.ibm.com> wrote: > >> > >> > >> > >> On 2021/6/25 16:54, Richard Biener wrote: > >>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches > >>> <gcc-patches@gcc.gnu.org> wrote: > >>>> > >>>> From: Xiong Hu Luo <luo...@linux.ibm.com> > >>>> > >>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance > >>>> on Power. For example, it generates mismatched address offset after > >>>> adjust iv update statement position: > >>>> > >>>> <bb 32> [local count: 70988443]: > >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>> _34 = ref_180 + 18446744073709551615; > >>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >>>> if (_84 == _86) > >>>> goto <bb 56>; [94.50%] > >>>> else > >>>> goto <bb 87>; [5.50%] > >>>> > >>>> Disable it will produce: > >>>> > >>>> <bb 32> [local count: 70988443]: > >>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1]; > >>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1]; > >>>> ivtmp.30_415 = ivtmp.30_414 + 1; > >>>> if (_84 == _86) > >>>> goto <bb 56>; [94.50%] > >>>> else > >>>> goto <bb 87>; [5.50%] > >>>> > >>>> Then later pass loop unroll could benefit from same address offset > >>>> with different base address and reduces register dependency. > >>>> This patch could improve performance by 10% for typical case on Power, > >>>> no performance change observed for X86 or Aarch64 due to small loops > >>>> not unrolled on these platforms. Any comments? > >>> > >>> The case you quote is special in that if we hoisted the IV update before > >>> the other MEM _also_ used in the condition it would be fine again. > >> > >> Thanks. I tried to hoist the IV update statement before the first MEM > >> (Fix 2), it > >> shows even worse performance due to not unroll(two more "base-1" is > >> generated in gimple, > >> then loop->ninsns is 11 so small loops is not unrolled), change the > >> threshold from > >> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, > >> the > >> performance is SAME to the one that IV update statement in the *MIDDLE* > >> (trunk). > >> From the ASM, we can see the index register %r4 is used in two iterations > >> which > >> maybe a bottle neck for hiding instruction latency? > >> > >> Then it seems reasonable the performance would be better if keep the IV > >> update > >> statement at *LAST* (Fix 1). > >> > >> (Fix 2): > >> <bb 32> [local count: 70988443]: > >> ivtmp.30_415 = ivtmp.30_414 + 1; > >> _34 = ip_229 + 18446744073709551615; > >> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1]; > >> _33 = ref_180 + 18446744073709551615; > >> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1]; > >> if (_84 == _86) > >> goto <bb 56>; [94.50%] > >> else > >> goto <bb 87>; [5.50%] > >> > >> > >> .L67: > >> lbzx %r12,%r24,%r4 > >> lbzx %r25,%r7,%r4 > >> cmpw %cr0,%r12,%r25 > >> bne %cr0,.L11 > >> mr %r26,%r4 > >> addi %r4,%r4,1 > >> lbzx %r12,%r24,%r4 > >> lbzx %r25,%r7,%r4 > >> mr %r6,%r26 > >> cmpw %cr0,%r12,%r25 > >> bne %cr0,.L11 > >> mr %r26,%r4 > >> .L12: > >> cmpdi %cr0,%r10,1 > >> addi %r4,%r26,1 > >> mr %r6,%r26 > >> addi %r10,%r10,-1 > >> bne %cr0,.L67 > >> > >>> > >>> Now, adjust_iv_update_pos doesn't seem to check that the > >>> condition actually uses the IV use stmt def, so it likely applies to > >>> too many cases. > >>> > >>> Unfortunately the introducing rev didn't come with a testcase, > >>> but still I think fixing up adjust_iv_update_pos is better than > >>> introducing a way to short-cut it per target decision. > >>> > >>> One "fix" might be to add a check that either the condition > >>> lhs or rhs is the def of the IV use and the other operand > >>> is invariant. Or if it's of similar structure hoist across the > >>> other iv-use as well. Not that I understand the argument > >>> about the overlapping life-range. > >>> > >>> You also don't provide a complete testcase ... > >>> > >> > >> Attached the test code, will also add it it patch in future version. > >> The issue comes from a very small hot loop: > >> > >> do { > >> len++; > >> } while(len < maxlen && ip[len] == ref[len]); > > > > unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int > > maxlen) > > { > > unsigned int len = 2; > > do { > > len++; > > }while(len < maxlen && ip[len] == ref[len]); > > return len; > > } > > > > I can see the effect on this loop on x86_64 as well, we end up with > > > > .L6: > > movzbl (%rdi,%rax), %ecx > > addq $1, %rax > > cmpb -1(%rsi,%rax), %cl > > jne .L1 > > .L3: > > movl %eax, %r8d > > cmpl %edx, %eax > > jb .L6 > > > > but without the trick it is > > > > .L6: > > movzbl (%rdi,%rax), %r8d > > movzbl (%rsi,%rax), %ecx > > addq $1, %rax > > cmpb %cl, %r8b > > jne .L1 > > .L3: > > movl %eax, %r9d > > cmpl %edx, %eax > > jb .L6 > > Verified this small piece of code on X86, there is no performance > change with or without adjust_iv_update_pos (I've checked the ASM > exactly same with yours): > > luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c > luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > > real 0m7.003s > user 0m6.996s > sys 0m0.000s > luoxhu@gcc14:~/workspace/lzf_compress_testcase$ > /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g > cc/ -O2 test.c > luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1 > > real 0m7.070s > user 0m7.068s > sys 0m0.000s > > > But for AArch64, current GCC code also generates similar code with > or without adjust_iv_update_pos, the runtime is 10.705s for them. > > L6: > ldrb w4, [x6, x3] > add x3, x3, 1 > ldrb w5, [x1, x3] > cmp w5, w4 > bne .L1 > .L3: > mov w0, w3 > cmp w2, w3 > bhi .L6 > > No adjust_iv_update_pos: > > .L6: > ldrb w5, [x6, x3] > ldrb w4, [x1, x3] > add x3, x3, 1 > cmp w5, w4 > bne .L1 > .L3: > mov w0, w3 > cmp w2, w3 > bhi .L6 > > > While built with old GCC(gcc version 7.4.1 20190424), it generates > worse code and runtime is 11.664s: > > .L6: > ldrb w4, [x6, x3] > add x3, x3, 1 > add x5, x1, x3 > ldrb w5, [x5, -1] > cmp w5, w4 > bne .L1 > .L3: > cmp w3, w2 > mov w0, w3 > bcc .L6 > > > In general, only Power shows negative performance with adjust_iv_update_pos, > that's why I tried to add target hook for it, is this reasonable? Or we > should > just remove adjust_iv_update_pos since it doesn't help performance for X86 or > other targets?
It was Bin who added the functionality - maybe he can remember the cases he saw improvements for. I think we should instead try to change the code that it doesn't apply to CONDs where both operands are defined by stmts using the IV? And this time add a testcase ;) Richard. > > > test.c > #include <stdlib.h> > __attribute__((noinline)) unsigned int > foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) > { > unsigned int len = 2; > do { > len++; > }while(len < maxlen && ip[len] == ref[len]); > return len; > } > > int main (int argc, char* argv[]) > { > unsigned char string_a [] = > "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeeee"; > unsigned char string_b [] = > "abcdefghijklmnopqrstuvwxyzmnpppppppppppaaaaaaabbbbbbeene"; > unsigned long ret = 0; > > for (long i = 0; i < atoi(argv[1]) * 100000000; i++) > ret += foo (string_a, string_b, sizeof(string_a)); > return ret; > } > > > > > > so here you can see the missed fusion. Of course > > in this case the IV update could have been sunk into > > the .L3 block and replicated on the exit edge as well. > > > > I'm not sure if the motivation for the change introducing this > > trick was the above kind of combination or not, but I guess > > so. The dependence distance of the IV increment to the > > use is now shorter, so I'm not sure the combined variant is > > better. > > > > Richard. > > > > > >> > >> -- > >> Thanks, > >> Xionghu > > -- > Thanks, > Xionghu