On Fri, Sep 16, 2011 at 4:00 AM, Jiangning Liu <jiangning....@arm.com> wrote: > Hi Richard, > > I slightly changed the case to be like below, > > int f(char *t) { > int s=0; > > while (*t && s != 1) { > switch (s) { > case 0: /* path 1 */ > s = 2; > break; > case 2: /* path 2 */ > s = 3; /* changed */ > break; > default: /* path 3 */ > if (*t == '-') > s = 2; > break; > } > t++; > } > > return s; > } > > "-O2" is still worse than "-O2 -fno-tree-pre". > > "-O2 -fno-tree-pre" result is > > f: > pushl %ebp > xorl %eax, %eax > movl %esp, %ebp > movl 8(%ebp), %edx > movzbl (%edx), %ecx > jmp .L14 > .p2align 4,,7 > .p2align 3 > .L5: > movl $2, %eax > .L7: > addl $1, %edx > cmpl $1, %eax > movzbl (%edx), %ecx > je .L3 > .L14: > testb %cl, %cl > je .L3 > testl %eax, %eax > je .L5 > cmpl $2, %eax > .p2align 4,,5 > je .L17 > cmpb $45, %cl > .p2align 4,,5 > je .L5 > addl $1, %edx > cmpl $1, %eax > movzbl (%edx), %ecx > jne .L14 > .p2align 4,,7 > .p2align 3 > .L3: > popl %ebp > .p2align 4,,2 > ret > .p2align 4,,7 > .p2align 3 > .L17: > movb $3, %al > .p2align 4,,3 > jmp .L7 > > While "-O2" result is > > f: > pushl %ebp > xorl %eax, %eax > movl %esp, %ebp > movl 8(%ebp), %edx > pushl %ebx > movzbl (%edx), %ecx > jmp .L14 > .p2align 4,,7 > .p2align 3 > .L5: > movl $1, %ebx > movl $2, %eax > .L7: > addl $1, %edx > testb %bl, %bl > movzbl (%edx), %ecx > je .L3 > .L14: > testb %cl, %cl > je .L3 > testl %eax, %eax > je .L5 > cmpl $2, %eax > .p2align 4,,5 > je .L16 > cmpb $45, %cl > .p2align 4,,5 > je .L5 > cmpl $1, %eax > setne %bl > addl $1, %edx > testb %bl, %bl > movzbl (%edx), %ecx > jne .L14 > .p2align 4,,7 > .p2align 3 > .L3: > popl %ebx > popl %ebp > ret > .p2align 4,,7 > .p2align 3 > .L16: > movl $1, %ebx > movb $3, %al > jmp .L7 > > You may notice that register ebx is introduced, and some more instructions > around ebx are generated as well. i.e. > > setne %bl > testb %bl, %bl > > I agree with you that in theory PRE does the right thing to minimize the > computation cost on gimple level. However, the problem is the cost of > converting comparison result to a bool value is not considered, so it > actually makes binary code worse. For this case, as I summarized below, to > complete the same functionality "With PRE" is worse than "Without PRE" for > all three paths, > > * Without PRE, > > Path1: > movl $2, %eax > cmpl $1, %eax > je .L3 > > Path2: > movb $3, %al > cmpl $1, %eax > je .L3 > > Path3: > cmpl $1, %eax > jne .L14 > > * With PRE, > > Path1: > movl $1, %ebx > movl $2, %eax > testb %bl, %bl > je .L3 > > Path2: > movl $1, %ebx > movb $3, %al > testb %bl, %bl > je .L3 > > Path3: > cmpl $1, %eax > setne %bl > testb %bl, %bl > jne .L14 > > Do you have any more thoughts?
It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Richard. > Thanks, > -Jiangning > >> -----Original Message----- >> From: Richard Guenther [mailto:richard.guent...@gmail.com] >> Sent: Tuesday, August 02, 2011 5:23 PM >> To: Jiangning Liu >> Cc: gcc@gcc.gnu.org >> Subject: Re: A case that PRE optimization hurts performance >> >> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning....@arm.com> >> wrote: >> > Hi, >> > >> > For the following simple test case, PRE optimization hoists >> computation >> > (s!=1) into the default branch of the switch statement, and finally >> causes >> > very poor code generation. This problem occurs in both X86 and ARM, >> and I >> > believe it is also a problem for other targets. >> > >> > int f(char *t) { >> > int s=0; >> > >> > while (*t && s != 1) { >> > switch (s) { >> > case 0: >> > s = 2; >> > break; >> > case 2: >> > s = 1; >> > break; >> > default: >> > if (*t == '-') >> > s = 1; >> > break; >> > } >> > t++; >> > } >> > >> > return s; >> > } >> > >> > Taking X86 as an example, with option "-O2" you may find 52 >> instructions >> > generated like below, >> > >> > 00000000 <f>: >> > 0: 55 push %ebp >> > 1: 31 c0 xor %eax,%eax >> > 3: 89 e5 mov %esp,%ebp >> > 5: 57 push %edi >> > 6: 56 push %esi >> > 7: 53 push %ebx >> > 8: 8b 55 08 mov 0x8(%ebp),%edx >> > b: 0f b6 0a movzbl (%edx),%ecx >> > e: 84 c9 test %cl,%cl >> > 10: 74 50 je 62 <f+0x62> >> > 12: 83 c2 01 add $0x1,%edx >> > 15: 85 c0 test %eax,%eax >> > 17: 75 23 jne 3c <f+0x3c> >> > 19: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi >> > 20: 0f b6 0a movzbl (%edx),%ecx >> > 23: 84 c9 test %cl,%cl >> > 25: 0f 95 c0 setne %al >> > 28: 89 c7 mov %eax,%edi >> > 2a: b8 02 00 00 00 mov $0x2,%eax >> > 2f: 89 fb mov %edi,%ebx >> > 31: 83 c2 01 add $0x1,%edx >> > 34: 84 db test %bl,%bl >> > 36: 74 2a je 62 <f+0x62> >> > 38: 85 c0 test %eax,%eax >> > 3a: 74 e4 je 20 <f+0x20> >> > 3c: 83 f8 02 cmp $0x2,%eax >> > 3f: 74 1f je 60 <f+0x60> >> > 41: 80 f9 2d cmp $0x2d,%cl >> > 44: 74 22 je 68 <f+0x68> >> > 46: 0f b6 0a movzbl (%edx),%ecx >> > 49: 83 f8 01 cmp $0x1,%eax >> > 4c: 0f 95 c3 setne %bl >> > 4f: 89 df mov %ebx,%edi >> > 51: 84 c9 test %cl,%cl >> > 53: 0f 95 c3 setne %bl >> > 56: 89 de mov %ebx,%esi >> > 58: 21 f7 and %esi,%edi >> > 5a: eb d3 jmp 2f <f+0x2f> >> > 5c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi >> > 60: b0 01 mov $0x1,%al >> > 62: 5b pop %ebx >> > 63: 5e pop %esi >> > 64: 5f pop %edi >> > 65: 5d pop %ebp >> > 66: c3 ret >> > 67: 90 nop >> > 68: b8 01 00 00 00 mov $0x1,%eax >> > 6d: 5b pop %ebx >> > 6e: 5e pop %esi >> > 6f: 5f pop %edi >> > 70: 5d pop %ebp >> > 71: c3 ret >> > >> > But with command line option "-O2 -fno-tree-pre", there are only 12 >> > instructions generated, and the code would be very clean like below, >> > >> > 00000000 <f>: >> > 0: 55 push %ebp >> > 1: 31 c0 xor %eax,%eax >> > 3: 89 e5 mov %esp,%ebp >> > 5: 8b 55 08 mov 0x8(%ebp),%edx >> > 8: 80 3a 00 cmpb $0x0,(%edx) >> > b: 74 0e je 1b <f+0x1b> >> > d: 80 7a 01 00 cmpb $0x0,0x1(%edx) >> > 11: b0 02 mov $0x2,%al >> > 13: ba 01 00 00 00 mov $0x1,%edx >> > 18: 0f 45 c2 cmovne %edx,%eax >> > 1b: 5d pop %ebp >> > 1c: c3 ret >> > >> > Do you have any idea about this? >> >> If you look at what PRE does it is clear that it's a profitable >> transformation >> as it reduces the number of instructions computed on the paths where >> s is set to a constant. If you ask for size optimization you won't get >> this transformation. >> >> Now - on the tree level we don't see the if-conversion opportunity >> (we don't have a conditional move instruction), but still we could >> improve phiopt to handle switch statements - not sure what we >> could do with the case in question though which is >> >> <bb 3>: >> switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>> >> >> <L11>: >> goto <bb 7> (<L10>); >> >> <L3>: >> if (D.2703_6 == 45) >> goto <bb 6>; >> else >> goto <bb 7> (<L10>); >> >> <bb 6>: >> >> # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)> >> <L10>: >> >> as the condition in at L3 complicates things. >> >> The code is also really really special (and written in an awful >> way ...). >> >> Richard. >> >> > Thanks, >> > -Jiangning >> > >> > >> > >> > > > > > >