------- Comment #1 from lessen42+gcc at gmail dot com 2009-07-28 22:27 ------- More specifically, on x86_64 the following is generated with gcc-4.4 -O3 -march=core2 -S _dct2x2dc_dconly: movswl 2(%rdi),%edx pushq %rbp addw (%rdi), %dx movswl 6(%rdi),%eax movq %rsp, %rbp addw 4(%rdi), %ax leal (%rax,%rdx), %ecx subw %ax, %dx movw %cx, (%rdi) movw %dx, 2(%rdi) leave ret
So it seems that the optimizer realizes that you don't need registers larger than 16-bits, which allows memory operands on x86, which is optimal for this case. However, other architectures follow this too literally, wasting instructions to truncate intermediate results to 16 bits. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40893