On 5/27/23 18:52, Stefan Kanthak wrote:
"Andrew Pinski" <pins...@gmail.com> wrote:

On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote:
Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:

--- dontcare.c ---
int ispowerof2(unsigned __int128 argument) {
     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 
64) == 1;
}
--- EOF ---

GCC 13.3    gcc -march=haswell -O3

https://gcc.godbolt.org/z/PPzYsPzMc
ispowerof2(unsigned __int128):
         popcnt  rdi, rdi
         popcnt  rsi, rsi
         add     esi, edi
         xor     eax, eax
         cmp     esi, 1
         sete    al
         ret

OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs 
output?
Because the popcount is going to the same register, there is no false
dependency ....
The false dependency errata only applies if the result of the popcnt
is going to a different register, the processor thinks it depends on
the result in that register from a previous instruction but it does
not (which is why it is called a false dependency). In this case it
actually does depend on the previous result since the input is the
same as the input.
OUCH, my fault; sorry for the confusion and the wrong accusation.

Nevertheless GCC fails to optimise code properly:

--- .c ---
int ispowerof2(unsigned long long argument) {
     return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3    gcc -m32 -mpopcnt -O3

https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
         xor     eax, eax
         xor     edx, edx
         popcnt  eax, [esp+4]
         popcnt  edx, [esp+8]
         add     eax, edx                 # eax is less than 64!
Less than or equal to 64 (consider the case when input is (unsigned long long)-1)
         cmp     eax, 1    ->    dec eax  # 2 bytes shorter
         sete    al
         movzx   eax, al                  # superfluous
Not when dec is used. Use dec and omit this instruction, you may get a result value of 0xffffff00 (consider the case when input is (unsigned long long)0).
         ret

5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.

JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument

JFTR: GCC is notorious for generating superfluous MOVZX instructions
       where its optimiser SHOULD be able see that the value is already
       less than 256!

Stefan

Reply via email to