"Andrew Pinski" <pins...@gmail.com> wrote: > On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kant...@nexgo.de> > wrote:
[...] >> Nevertheless GCC fails to optimise code properly: >> >> --- .c --- >> int ispowerof2(unsigned long long argument) { >> return __builtin_popcountll(argument) == 1; >> } >> --- EOF --- >> >> GCC 13.3 gcc -m32 -mpopcnt -O3 >> >> https://godbolt.org/z/fT7a7jP4e >> ispowerof2(unsigned long long): >> xor eax, eax >> xor edx, edx >> popcnt eax, [esp+4] >> popcnt edx, [esp+8] >> add eax, edx # eax is less than 64! >> cmp eax, 1 -> dec eax # 2 bytes shorter > > dec eax is done for -Os already. -O2 means performance, it does not > mean decrease size. dec can be slower as it can create a false > dependency and it requires eax register to be not alive at the end of > the statement. and IIRC for x86 decode, it could cause 2 (not 1) > micro-ops. It CAN, it COULD, but is does NOT NEED to: it all depends on the target processor. Shall I add an example with -march=<not affected processor>? >> sete al Depending on the target processor the partial register can also harm the performance. Did you forget to mention that too? >> movzx eax, al # superfluous > > No it is not superfluous, well ok it is because of the context of eax > (besides the lower 8 bits) are already zero'd Correct. The same holds for example for PMOVMSKB when the high(er) lane(s) of the source [XYZ]MM register are (known to be) 0, for example after MOVQ; that's what GCC also fails to track. > but keeping that track is a hard problem and is turning problem really. Aren't such problems just there to be solved? > And I suspect it would cause another false dependency later on too. All these quirks can be avoided with the following 6-byte code sequence (same size as SETcc plus MOVZX) I used in one of my previous posts to fold any non-zero value to 1: neg eax sbb eax, eax neg eax No partial register writes, no false dependencies, no INC/DEC subleties. JFTR: AMD documents that SBB with same destination and source is handled in the register renamer; I suspect Intel processors do it too, albeit not documented. > For -Os -march=skylake (and -Oz instead of -Os) we get: > popcnt rdi, rdi > popcnt rsi, rsi > add esi, edi > xor eax, eax > dec esi > sete al > > Which is exactly what you want right? Yes. For -m32 -Os/-Oz, AND if CDQ breaks the dependency, it should be xor eax, eax xor edx, edx -> cdq # 1 byte shorter popcnt eax, [esp+4] popcnt edx, [esp+8] add eax, edx # eax is less than 64! cmp eax, 1 -> dec eax # 2 bytes shorter On AMD64 DEC <r32> is a 2-byte instruction; the following alternative code avoids its potential false dependency as well as other possible quirks, and also suits -Ot, -O2 and -O3 on processors where the register renamer handles the XOR: popcnt rdi, rdi popcnt rsi, rsi xor eax, eax not edi # edi = -(edi + 1) sub edi, esi # edi = -(edi + 1 + esi) setz al For processors where the register renamer doesn't "execute" XOR, but MOV, the following code is an alternative for -Ot, -O2 and -O3: popcnt rdi, rdi popcnt rsi, rsi mov eax, edi add eax, esi cmp eax, 1 setz al Stefan