https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #7 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Please ignore my previous comment - if we insert nullifying of destination
register before each popcnt (and lzcnt) performance will restore:

original test results:

unsigned        83886630000     0.848533 sec    24.715 GB/s
uint64_t        83886630000     1.37436 sec     15.2592 GB/s

fixed popcnt:

unsigned        90440370000     0.853753 sec    24.5639 GB/s
uint64_t        83886630000     0.694458 sec    30.1984 GB/s

Here is assembly for 2nd loop:

.L16:
    xorq    %rax, %rax    
    popcntq    -8(%rdx), %rax
    xorq    %rcx, %rcx    
    popcntq    (%rdx), %rcx
    addq    %rax, %rcx
    xorq    %rax, %rax    
    popcntq    8(%rdx), %rax
    addq    %rcx, %rax
    addq    $32, %rdx
    xorq    %rcx, %rcx    
    popcntq    -16(%rdx), %rcx
    addq    %rax, %rcx
    addq    %rcx, %r13
    cmpq    %rsi, %rdx
    jne    .L16

Reply via email to