https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275
--- Comment #8 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Peter Cordes from comment #0) > I wrote it in that ugly way initially because I was basically porting my ASM > code to intrinsics. BTW, the results were terrible. gcc generates > ridiculously bad code for getting the src bytes into zero-extended 64bit > regs, for use as scaled-offsets in an address, compared to > > movzx %dl, %eax > movxz %dh, %ebx > shr $16, %rdx > use rax/rbx > movzx %dl, %eax > movxz %dh, %ebx > shr $16, %rdx > use rax/rbx > ... > > gcc never just shifts the reg holding src data. Instead if copies, and > shifts the copy by $16, $32, or $48. > > gcc's code is about 35% slower than the hand-written version, even letting > it use avx so it doesn't emit useless movdqa instructions when it doesn't > realize that an old value is no longer needed. Just un-comment the > mostly-commented loop body in the testcase (attachment version). > > Anyway, slow code is off-topic, this bug is about wrong code! Please open a new PR for this bug.