Re: A case that PRE optimization hurts performance

Richard Guenther Tue, 02 Aug 2011 02:23:06 -0700

On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <[email protected]> wrote:
> Hi,
>
> For the following simple test case, PRE optimization hoists computation
> (s!=1) into the default branch of the switch statement, and finally causes
> very poor code generation. This problem occurs in both X86 and ARM, and I
> believe it is also a problem for other targets.
>
> int f(char *t) {
>    int s=0;
>
>    while (*t && s != 1) {
>        switch (s) {
>        case 0:
>            s = 2;
>            break;
>        case 2:
>            s = 1;
>            break;
>        default:
>            if (*t == '-')
>                s = 1;
>            break;
>        }
>        t++;
>    }
>
>    return s;
> }
>
> Taking X86 as an example, with option "-O2" you may find 52 instructions
> generated like below,
>
> 00000000 <f>:
>   0:   55                      push   %ebp
>   1:   31 c0                   xor    %eax,%eax
>   3:   89 e5                   mov    %esp,%ebp
>   5:   57                      push   %edi
>   6:   56                      push   %esi
>   7:   53                      push   %ebx
>   8:   8b 55 08                mov    0x8(%ebp),%edx
>   b:   0f b6 0a                movzbl (%edx),%ecx
>   e:   84 c9                   test   %cl,%cl
>  10:   74 50                   je     62 <f+0x62>
>  12:   83 c2 01                add    $0x1,%edx
>  15:   85 c0                   test   %eax,%eax
>  17:   75 23                   jne    3c <f+0x3c>
>  19:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
>  20:   0f b6 0a                movzbl (%edx),%ecx
>  23:   84 c9                   test   %cl,%cl
>  25:   0f 95 c0                setne  %al
>  28:   89 c7                   mov    %eax,%edi
>  2a:   b8 02 00 00 00          mov    $0x2,%eax
>  2f:   89 fb                   mov    %edi,%ebx
>  31:   83 c2 01                add    $0x1,%edx
>  34:   84 db                   test   %bl,%bl
>  36:   74 2a                   je     62 <f+0x62>
>  38:   85 c0                   test   %eax,%eax
>  3a:   74 e4                   je     20 <f+0x20>
>  3c:   83 f8 02                cmp    $0x2,%eax
>  3f:   74 1f                   je     60 <f+0x60>
>  41:   80 f9 2d                cmp    $0x2d,%cl
>  44:   74 22                   je     68 <f+0x68>
>  46:   0f b6 0a                movzbl (%edx),%ecx
>  49:   83 f8 01                cmp    $0x1,%eax
>  4c:   0f 95 c3                setne  %bl
>  4f:   89 df                   mov    %ebx,%edi
>  51:   84 c9                   test   %cl,%cl
>  53:   0f 95 c3                setne  %bl
>  56:   89 de                   mov    %ebx,%esi
>  58:   21 f7                   and    %esi,%edi
>  5a:   eb d3                   jmp    2f <f+0x2f>
>  5c:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
>  60:   b0 01                   mov    $0x1,%al
>  62:   5b                      pop    %ebx
>  63:   5e                      pop    %esi
>  64:   5f                      pop    %edi
>  65:   5d                      pop    %ebp
>  66:   c3                      ret
>  67:   90                      nop
>  68:   b8 01 00 00 00          mov    $0x1,%eax
>  6d:   5b                      pop    %ebx
>  6e:   5e                      pop    %esi
>  6f:   5f                      pop    %edi
>  70:   5d                      pop    %ebp
>  71:   c3                      ret
>
> But with command line option "-O2 -fno-tree-pre", there are only 12
> instructions generated, and the code would be very clean like below,
>
> 00000000 <f>:
>   0:   55                      push   %ebp
>   1:   31 c0                   xor    %eax,%eax
>   3:   89 e5                   mov    %esp,%ebp
>   5:   8b 55 08                mov    0x8(%ebp),%edx
>   8:   80 3a 00                cmpb   $0x0,(%edx)
>   b:   74 0e                   je     1b <f+0x1b>
>   d:   80 7a 01 00             cmpb   $0x0,0x1(%edx)
>  11:   b0 02                   mov    $0x2,%al
>  13:   ba 01 00 00 00          mov    $0x1,%edx
>  18:   0f 45 c2                cmovne %edx,%eax
>  1b:   5d                      pop    %ebp
>  1c:   c3                      ret
>
> Do you have any idea about this?


If you look at what PRE does it is clear that it's a profitable transformation
as it reduces the number of instructions computed on the paths where
s is set to a constant.  If you ask for size optimization you won't get
this transformation.

Now - on the tree level we don't see the if-conversion opportunity
(we don't have a conditional move instruction), but still we could
improve phiopt to handle switch statements - not sure what we
could do with the case in question though which is

<bb 3>:
  switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>

<L11>:
  goto <bb 7> (<L10>);

<L3>:
  if (D.2703_6 == 45)
    goto <bb 6>;
  else
    goto <bb 7> (<L10>);

<bb 6>:

  # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)>
<L10>:

as the condition in at L3 complicates things.

The code is also really really special (and written in an awful way ...).

Richard.

> Thanks,
> -Jiangning
>
>
>
>

Re: A case that PRE optimization hurts performance

Reply via email to