------- Comment #1 from lessen42+gcc at gmail dot com  2009-07-28 22:27 -------
More specifically, on x86_64 the following is generated with gcc-4.4 -O3
-march=core2 -S
_dct2x2dc_dconly:
        movswl  2(%rdi),%edx
        pushq   %rbp
        addw    (%rdi), %dx
        movswl  6(%rdi),%eax
        movq    %rsp, %rbp
        addw    4(%rdi), %ax
        leal    (%rax,%rdx), %ecx
        subw    %ax, %dx
        movw    %cx, (%rdi)
        movw    %dx, 2(%rdi)
        leave
        ret

So it seems that the optimizer realizes that you don't need registers larger
than 16-bits, which allows memory operands on x86, which is optimal for this
case. However, other architectures follow this too literally, wasting
instructions to truncate intermediate results to 16 bits.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40893

Reply via email to