On Mon, 12 Nov 2012, Bruce Evans wrote:

On Sun, 11 Nov 2012, Dimitry Andric wrote:

It works just fine now with clang.  For the first example, I get:

       pushl   %ebp
       movl    %esp, %ebp
       andl    $-32, %esp

as prolog, and for the second:

       pushl   %ebp
       movl    %esp, %ebp
       andl    $-16, %esp

Good.

The andl executes very fast.  Perhaps not as fast as subl on %esp,
because subl is normal so more likely to be optimized (they nominally
have the same speeds, but %esp is magic).  Unfortunately, it seems to
be impossible to both align the stack and reserve some space on it in
1 instruction -- the andl might not reserve any.

I lost kib's reply to this.  He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.

The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:

@ asm("                                    \n\
@ .globl main                           \n\
@ main:                                 \n\
@       movl    $266681734,%eax         \n\
@       # movl  $201017002,%eax         \n\
@ 1:                                    \n\
@       call    foo1                    \n\
@       decl    %eax                    \n\
@       jne     1b                      \n\
@       ret                             \n\
@                                       \n\
@ foo1:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo2                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo2:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo3                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo3:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo4                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo4:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo5                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo5:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo6                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo6:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo7                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo7:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       call    foo8                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@                                       \n\
@ foo8:                                 \n\
@       pushl   %ebp                    \n\
@       movl    %esp,%ebp               \n\
@       andl    $-16,%esp               \n\
@       # call  foo9                    \n\
@       movl    %ebp,%esp               \n\
@       popl    %ebp                    \n\
@       ret                             \n\
@ ");

Build this on an i386 system so that it is 32-bit mode.

This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560.  Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster.  This shows that the gcc pessimization
is largest on X6560 :-).  Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560.  I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to