On Mon, 12 Nov 2012, Bruce Evans wrote:
On Sun, 11 Nov 2012, Dimitry Andric wrote:
It works just fine now with clang. For the first example, I get:
pushl %ebp
movl %esp, %ebp
andl $-32, %esp
as prolog, and for the second:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
Good.
The andl executes very fast. Perhaps not as fast as subl on %esp,
because subl is normal so more likely to be optimized (they nominally
have the same speeds, but %esp is magic). Unfortunately, it seems to
be impossible to both align the stack and reserve some space on it in
1 instruction -- the andl might not reserve any.
I lost kib's reply to this. He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.
The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:
@ asm(" \n\
@ .globl main \n\
@ main: \n\
@ movl $266681734,%eax \n\
@ # movl $201017002,%eax \n\
@ 1: \n\
@ call foo1 \n\
@ decl %eax \n\
@ jne 1b \n\
@ ret \n\
@ \n\
@ foo1: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo2 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo2: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo3 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo3: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo4 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo4: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo5 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo5: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo6 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo6: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo7 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo7: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ call foo8 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ \n\
@ foo8: \n\
@ pushl %ebp \n\
@ movl %esp,%ebp \n\
@ andl $-16,%esp \n\
@ # call foo9 \n\
@ movl %ebp,%esp \n\
@ popl %ebp \n\
@ ret \n\
@ ");
Build this on an i386 system so that it is 32-bit mode.
This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560. Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster. This shows that the gcc pessimization
is largest on X6560 :-). Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560. I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"