------- Comment #22 from xuepeng dot guo at intel dot com 2009-02-11 07:37 ------- (In reply to comment #18) > Xuepeng, can you test with the loop as produced by my posted patch, that is: > .L11: > movaps (%rsi,%rax), %xmm0 > addps %xmm1, %xmm0 > movaps %xmm0, (%rdi,%rax) > addq $16, %rax > cmpq %rdx, %rax > jne .L11 > I don't have access to new enough chips.
Your patch improved the performance. My machine is "Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz". The results are: [xg...@shgcc-9 38824]$ time ./gcc-42.out real 0m1.991s user 0m1.990s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-42.out real 0m1.991s user 0m1.991s sys 0m0.001s [xg...@shgcc-9 38824]$ time ./gcc-42.out real 0m1.991s user 0m1.989s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44.out real 0m1.880s user 0m1.879s sys 0m0.001s [xg...@shgcc-9 38824]$ time ./gcc-44.out real 0m1.878s user 0m1.878s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-44.out real 0m1.870s user 0m1.869s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real 0m1.690s user 0m1.690s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real 0m1.690s user 0m1.689s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real 0m1.690s user 0m1.690s sys 0m0.000s The only difference is: --- 44.s 2009-02-11 15:34:57.000000000 +0800 +++ 44p.s 2009-02-11 15:34:49.000000000 +0800 @@ -102,8 +102,8 @@ _Z7bench_1PfS_fj: .p2align 4,,10 .p2align 3 .L11: - movaps %xmm0, %xmm1 - addps (%rsi,%rax), %xmm1 + movaps (%rsi,%rax), %xmm1 + addps %xmm0, %xmm1 movaps %xmm1, (%rdi,%rax) addq $16, %rax cmpq %rdx, %rax -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824