4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

rguenth at gcc dot gnu dot org Sun, 25 Apr 2010 13:03:46 -0700


------- Comment #3 from rguenth at gcc dot gnu dot org  2010-04-25 20:03 -------
Well, the innermost loop with current trunk is


.L3:
        leal    -1(%ebx), %eax
        subl    $2, %ebx
        movl    %eax, (%esp)
        call    fib
        addl    %eax, %esi
        cmpl    $2, %ebx
        jg      .L3

which is pretty much optimal.  The intel compiler doesn't detect the
tail-recursion (huh) but has multiple entry-points into the function
and uses register passing conventions for the recursions.

With -fwhole-program GCC does the same (or with static fib), and we
then end up with a program faster than what ICC produces (16s)
A 4.3 compiled version is indeed a bit faster (as fast as 4.4 on i?86, 15.4s).
A 4.1 compiled version is even faster (14.1s), the 3.4 baseline is 21.5s.

That's on i?86-linux, all -O2.

4.1 assembly, fib is not inlined:

fib:
        pushl   %esi
        pushl   %ebx
        movl    %eax, %ebx
        cmpl    $2, %ebx
        movl    $1, %eax
        jle     .L5
        xorl    %esi, %esi
        .p2align 4,,7
.L6:
        leal    -1(%ebx), %eax
        subl    $2, %ebx
        call    fib
        addl    %eax, %esi
        cmpl    $2, %ebx
        jg      .L6
        leal    1(%esi), %eax
.L5:
        popl    %ebx
        popl    %esi
        ret

trunk assembler:

fib:
        pushl   %esi
        pushl   %ebx
        movl    %eax, %ebx
        subl    $4, %esp
        cmpl    $2, %ebx
        movl    $1, %eax
        jle     .L2
        xorl    %esi, %esi
        .p2align 4,,7
        .p2align 3
.L3:
        leal    -1(%ebx), %eax
        subl    $2, %ebx
        call    fib
        addl    %eax, %esi
        cmpl    $2, %ebx
        jg      .L3
        leal    1(%esi), %eax
.L2:
        addl    $4, %esp
        popl    %ebx
        popl    %esi
        ret

where the only difference is different loop alignment and keeping the
stack 16-bytes aligned.  Indeed we get the same speed as 4.1 when
building with -mpreffered-stack-boundary=2.  Why do we bother to
keep the stack aligned for leaf functions?


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl at gcc dot gnu dot org,
                   |                            |hubicka at gcc dot gnu dot
                   |                            |org
          Component|c++                         |target
 GCC target triplet|                            |i?86-*-*
           Keywords|                            |missed-optimization
      Known to work|                            |4.1.3
            Summary|[4.4/4.5 Regression]        |[4.4/4.5/4.6 Regression]
                   |Performance degradation for |Performance degradation for
                   |simple fibonacci numbers    |simple fibonacci numbers
                   |calculation                 |calculation due to extra
                   |                            |stack alignment
   Target Milestone|---                         |4.4.4


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884

[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

Reply via email to