https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
--- Comment #7 from Yuri Rumyantsev <ysrumyan at gmail dot com> --- Please ignore my previous comment - if we insert nullifying of destination register before each popcnt (and lzcnt) performance will restore: original test results: unsigned 83886630000 0.848533 sec 24.715 GB/s uint64_t 83886630000 1.37436 sec 15.2592 GB/s fixed popcnt: unsigned 90440370000 0.853753 sec 24.5639 GB/s uint64_t 83886630000 0.694458 sec 30.1984 GB/s Here is assembly for 2nd loop: .L16: xorq %rax, %rax popcntq -8(%rdx), %rax xorq %rcx, %rcx popcntq (%rdx), %rcx addq %rax, %rcx xorq %rax, %rax popcntq 8(%rdx), %rax addq %rcx, %rax addq $32, %rdx xorq %rcx, %rcx popcntq -16(%rdx), %rcx addq %rax, %rcx addq %rcx, %r13 cmpq %rsi, %rdx jne .L16