https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org --- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- So with 2 bytes we get .L3: movzwl (%rax), %edx addq $3, %rax movw %dx, 8(%rsp) movq 8(%rsp), %rdx imulq %rcx, %rdx shrq $48, %rdx addq %rdx, %rsi cmpq %rdi, %rax jne .L3 while with 3 bytes we see .L3: movzwl (%rax), %edx addq $3, %rax movw %dx, 8(%rsp) movzbl -1(%rax), %edx movb %dl, 10(%rsp) movq 8(%rsp), %rdx imulq %rcx, %rdx shrq $48, %rdx addq %rdx, %rsi cmpq %rdi, %rax jne .L3 while clang outputs .LBB0_3: # =>This Inner Loop Header: Depth=1 movzwl (%r14,%rcx), %edx movzbl 2(%r14,%rcx), %edi shlq $16, %rdi orq %rdx, %rdi andq $-16777216, %rbx # imm = 0xFFFFFFFFFF000000 orq %rdi, %rbx movq %rbx, %rdx imulq %rax, %rdx shrq $48, %rdx addq %rdx, %rsi addq $3, %rcx cmpq $999999992, %rcx # imm = 0x3B9AC9F8 jb .LBB0_3 that _looks_ slower. Are you sure performance isn't dominated by the first init loop (both GCC and clang vectorize it). I notice we spill in the above loop for the bitfield insert where clang uses register operations. We refuse to inline the memcpy at the GIMPLE level and further refuse to optimzie it to a BIT_INSERT_EXPR which would be a possibility.