Hi, compile the following function on a system with Core2 processor (released January 2008) for the 32-bit execution environment:
--- demo.c --- int ispowerof2(unsigned long long argument) { return (argument & argument - 1) == 0; } --- EOF --- GCC 13.3: gcc -m32 -O3 demo.c NOTE: -mtune=native is the default! # https://godbolt.org/z/b43cjGdY9 ispowerof2(unsigned long long): movq xmm1, [esp+4] pcmpeqd xmm0, xmm0 paddq xmm0, xmm1 pand xmm0, xmm1 movd edx, xmm0 # pxor xmm1, xmm1 psrlq xmm0, 32 # pcmpeqb xmm0, xmm1 movd eax, xmm0 # pmovmskb eax, xmm0 or edx, eax # cmp al, 255 sete al # sete al movzx eax, al # ret 11 instructions in 40 bytes # 10 instructions in 36 bytes OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set" here instead of the native SSE4.1 alias "Penryn New Instruction Set" of the Core2 (and all later processors)? OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the right side? Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1 alias "Penryn New Instruction Set" of the Core2 processor: GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c # https://godbolt.org/z/svhEoYT11 ispowerof2(unsigned long long): # xor eax, eax movq xmm1, [esp+4] # movq xmm1, [esp+4] pcmpeqd xmm0, xmm0 # pcmpeqq xmm0, xmm0 paddq xmm0, xmm1 # paddq xmm0, xmm1 pand xmm0, xmm1 # ptest xmm0, xmm1 movd edx, xmm0 # psrlq xmm0, 32 # movd eax, xmm0 # or edx, eax # sete al # sete al movzx eax, al # ret # ret 11 instructions in 40 bytes # 7 instructions in 26 bytes OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Last compile with -mtune=i386 for the i386 processor: GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c # https://godbolt.org/z/e76W6dsMj ispowerof2(unsigned long long): push ebx # mov ecx, [esp+8] # mov eax, [esp+4] mov ebx, [esp+12] # mov edx, [esp+8] mov eax, ecx # mov edx, ebx # add eax, -1 # add eax, -1 adc edx, -1 # adc edx, -1 and eax, ecx # and eax, [esp+4] and edx, ebx # and edx, [esp+8] or eax, edx # or eax, edx sete al # neg eax movzx eax, al # sbb eax, eax pop ebx # inc eax ret # ret 14 instructions in 33 bytes # 11 instructions in 32 bytes OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous memory write? Stefan Kanthak