[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

rguenth at gcc dot gnu.org Tue, 06 Mar 2018 00:32:29 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with 2 bytes we get

.L3:
        movzwl  (%rax), %edx
        addq    $3, %rax
        movw    %dx, 8(%rsp)
        movq    8(%rsp), %rdx
        imulq   %rcx, %rdx
        shrq    $48, %rdx
        addq    %rdx, %rsi
        cmpq    %rdi, %rax
        jne     .L3

while with 3 bytes we see

.L3:
        movzwl  (%rax), %edx
        addq    $3, %rax
        movw    %dx, 8(%rsp)
        movzbl  -1(%rax), %edx
        movb    %dl, 10(%rsp)
        movq    8(%rsp), %rdx
        imulq   %rcx, %rdx
        shrq    $48, %rdx
        addq    %rdx, %rsi
        cmpq    %rdi, %rax
        jne     .L3

while clang outputs

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        movzwl  (%r14,%rcx), %edx
        movzbl  2(%r14,%rcx), %edi
        shlq    $16, %rdi
        orq     %rdx, %rdi
        andq    $-16777216, %rbx        # imm = 0xFFFFFFFFFF000000
        orq     %rdi, %rbx
        movq    %rbx, %rdx
        imulq   %rax, %rdx
        shrq    $48, %rdx
        addq    %rdx, %rsi
        addq    $3, %rcx
        cmpq    $999999992, %rcx        # imm = 0x3B9AC9F8
        jb      .LBB0_3

that _looks_ slower.  Are you sure performance isn't dominated by the
first init loop (both GCC and clang vectorize it).  I notice we spill
in the above loop for the bitfield insert where clang uses register
operations.  We refuse to inline the memcpy at the GIMPLE level
and further refuse to optimzie it to a BIT_INSERT_EXPR which would
be a possibility.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

Reply via email to