https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275

--- Comment #8 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Peter Cordes from comment #0)

> I wrote it in that ugly way initially because I was basically porting my ASM
> code to intrinsics.  BTW, the results were terrible.  gcc generates
> ridiculously bad code for getting the src bytes into zero-extended 64bit
> regs, for use as scaled-offsets in an address, compared to
> 
> movzx %dl, %eax
> movxz %dh, %ebx
> shr   $16, %rdx
> use rax/rbx
> movzx %dl, %eax
> movxz %dh, %ebx
> shr   $16, %rdx
> use rax/rbx
>  ...
> 
>  gcc never just shifts the reg holding src data.  Instead if copies, and
> shifts the copy by $16, $32, or $48.
> 
>  gcc's code is about 35% slower than the hand-written version, even letting
> it use avx so it doesn't emit useless movdqa instructions when it doesn't
> realize that an old value is no longer needed.  Just un-comment the
> mostly-commented loop body in the testcase (attachment version).
> 
>  Anyway, slow code is off-topic, this bug is about wrong code!

Please open a new PR for this bug.

Reply via email to