On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote: > > "Andrew Pinski" <pins...@gmail.com> wrote: > > > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kant...@nexgo.de> > > wrote: > >> > >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers > >> are: > >> > >> --- dontcare.c --- > >> int ispowerof2(unsigned __int128 argument) { > >> return __builtin_popcountll(argument) + __builtin_popcountll(argument > >> >> 64) == 1; > >> } > >> --- EOF --- > >> > >> GCC 13.3 gcc -march=haswell -O3 > >> > >> https://gcc.godbolt.org/z/PPzYsPzMc > >> ispowerof2(unsigned __int128): > >> popcnt rdi, rdi > >> popcnt rsi, rsi > >> add esi, edi > >> xor eax, eax > >> cmp esi, 1 > >> sete al > >> ret > >> > >> OOPS: what about Intel's CPU errata regarding the false dependency on > >> POPCNTs output? > > > > Because the popcount is going to the same register, there is no false > > dependency .... > > The false dependency errata only applies if the result of the popcnt > > is going to a different register, the processor thinks it depends on > > the result in that register from a previous instruction but it does > > not (which is why it is called a false dependency). In this case it > > actually does depend on the previous result since the input is the > > same as the input. > > OUCH, my fault; sorry for the confusion and the wrong accusation. > > Nevertheless GCC fails to optimise code properly: > > --- .c --- > int ispowerof2(unsigned long long argument) { > return __builtin_popcountll(argument) == 1; > } > --- EOF --- > > GCC 13.3 gcc -m32 -mpopcnt -O3 > > https://godbolt.org/z/fT7a7jP4e > ispowerof2(unsigned long long): > xor eax, eax > xor edx, edx > popcnt eax, [esp+4] > popcnt edx, [esp+8] > add eax, edx # eax is less than 64! > cmp eax, 1 -> dec eax # 2 bytes shorter
dec eax is done for -Os already. -O2 means performance, it does not mean decrease size. dec can be slower as it can create a false dependency and it requires eax register to be not alive at the end of the statement. and IIRC for x86 decode, it could cause 2 (not 1) micro-ops. > sete al > movzx eax, al # superfluous No it is not superfluous, well ok it is because of the context of eax (besides the lower 8 bits) are already zero'd but keeping that track is a hard problem and is turning problem really. And I suspect it would cause another false dependency later on too. For -Os -march=skylake (and -Oz instead of -Os) we get: popcnt rdi, rdi popcnt rsi, rsi add esi, edi xor eax, eax dec esi sete al Which is exactly what you want right? Thanks, Andrew > ret > > 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to > kilo- or even megabytes, and they can extend code to cross a cache line > or a 16-byte alignment boundary. > > JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument > > JFTR: GCC is notorious for generating superfluous MOVZX instructions > where its optimiser SHOULD be able see that the value is already > less than 256! > > Stefan