------- Comment #58 from bonzini at gnu dot org 2009-05-06 09:56 -------
Uhm, it's better to run unpatched 4.5 with -O1 -fforward-propagate to get a
fair comparison. Also, I was counting the loop headers, which are not part of
the hot code.
4.2 -O1 4.5 -O1 -ffw-prop 4.5 + patch -O1
LOOP 1 181 201 180
INNER LOOP 1.1 117 118 113
LOOP 2 27 27 26
This shows that you should compare running the code (you can use direct.i) with
4.2/-O1 and 4.5/-O1 -fforward-propagate. This is very important, otherwise
you're comparing apples to oranges.
fwprop is creating too high register pressure by creating offsets like these in
the loop header:
leaq -8(%r12), %rsi
leaq 8(%r12), %r10
leaq -16(%r12), %r9
leaq -24(%r12), %rbx
leaq -32(%r12), %rbp
leaq -40(%r12), %rdi
leaq -48(%r12), %r11
leaq 40(%r12), %rdx
Then, the additional register pressure is causing the bad scheduling we have in
the fast assembly outputs:
movq (%rdx), %rax
movsd (%rax,%r15,2), %xmm7
movq (%rdi), %r15
movsd (%rax,%r15,2), %xmm10
movq (%rbp), %r15
movsd (%rax,%r15,2), %xmm5
movq (%rbx), %r15
movsd (%rax,%r15,2), %xmm6
movq (%r9), %r15
movsd (%rax,%r15,2), %xmm15
movq (%rsi), %r15
movsd (%rax,%r15,2), %xmm11
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928