On 5/27/23 18:52, Stefan Kanthak wrote:
"Andrew Pinski" <pins...@gmail.com> wrote:
On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote:
Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
--- dontcare.c ---
int ispowerof2(unsigned __int128 argument) {
return __builtin_popcountll(argument) + __builtin_popcountll(argument >>
64) == 1;
}
--- EOF ---
GCC 13.3 gcc -march=haswell -O3
https://gcc.godbolt.org/z/PPzYsPzMc
ispowerof2(unsigned __int128):
popcnt rdi, rdi
popcnt rsi, rsi
add esi, edi
xor eax, eax
cmp esi, 1
sete al
ret
OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs
output?
Because the popcount is going to the same register, there is no false
dependency ....
The false dependency errata only applies if the result of the popcnt
is going to a different register, the processor thinks it depends on
the result in that register from a previous instruction but it does
not (which is why it is called a false dependency). In this case it
actually does depend on the previous result since the input is the
same as the input.
OUCH, my fault; sorry for the confusion and the wrong accusation.
Nevertheless GCC fails to optimise code properly:
--- .c ---
int ispowerof2(unsigned long long argument) {
return __builtin_popcountll(argument) == 1;
}
--- EOF ---
GCC 13.3 gcc -m32 -mpopcnt -O3
https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
xor eax, eax
xor edx, edx
popcnt eax, [esp+4]
popcnt edx, [esp+8]
add eax, edx # eax is less than 64!
Less than or equal to 64 (consider the case when input is (unsigned long
long)-1)
cmp eax, 1 -> dec eax # 2 bytes shorter
sete al
movzx eax, al # superfluous
Not when dec is used. Use dec and omit this instruction, you may get a
result value of 0xffffff00 (consider the case when input is (unsigned
long long)0).
ret
5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.
JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
JFTR: GCC is notorious for generating superfluous MOVZX instructions
where its optimiser SHOULD be able see that the value is already
less than 256!
Stefan