For the following code: ------------------------------------------------ uint8_t data[16];
static __attribute__((noinline)) void test(unsigned i) { unsigned j; for (j = 0; j < 16; j++) data[j] = ((i + j) & 0xFF00) >> 8; } ------------------------------------------------ generated asm looks like (using -fno-tree-vectorize because of pr40771 ) # ./gcc tst2b.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize ------------------------------------------------ test: .LFB11: .cfi_startproc movq %rdi, %rdx movzbl %dh, %eax movb %al, data(%rip) leal 1(%rdi), %eax movzbl %ah, %eax movb %al, data+1(%rip) leal 2(%rdi), %eax movzbl %ah, %eax movb %al, data+2(%rip) leal 3(%rdi), %eax movzbl %ah, %eax movb %al, data+3(%rip) ..... ------------------------------------------------ When " movzbl %ah, %eax ; movb %al, data+1(%rip) " is replaced by " movb %ah, data+1(%rip) ", code is faster. (other issue may be using lea even for -march=pentium4 which would probably prefer add eax,1, but I can't verify that) # ./gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --enable-languages=c,c++ --prefix=/mnt/svn/gcc-trunk/build/ Thread model: posix gcc version 4.5.0 20090714 (experimental) (GCC) CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz. gcc's generated code runs in 19 ticks in average, code with "movzbl ; mov al" replaced by "mov ah" runs in 16 ticks. Attached is whole test code. -- Summary: generating rendundant moves from second byte of 32b/64b register Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: zsojka at seznam dot cz GCC host triplet: x86_64-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772