On Thu, May 25, 2023 at 11:56 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote: > > Hi, > > compile the following function on a system with Core2 processor > (released January 2008) for the 32-bit execution environment: > > --- demo.c --- > int ispowerof2(unsigned long long argument) > { > return (argument & argument - 1) == 0; > } > --- EOF --- > > GCC 13.3: gcc -m32 -O3 demo.c > > NOTE: -mtune=native is the default!
You need to use -march=native and not -mtune=native .... to turn on the architecture features. Thanks, Andrew > > # https://godbolt.org/z/b43cjGdY9 > ispowerof2(unsigned long long): > movq xmm1, [esp+4] > pcmpeqd xmm0, xmm0 > paddq xmm0, xmm1 > pand xmm0, xmm1 > movd edx, xmm0 # pxor xmm1, xmm1 > psrlq xmm0, 32 # pcmpeqb xmm0, xmm1 > movd eax, xmm0 # pmovmskb eax, xmm0 > or edx, eax # cmp al, 255 > sete al # sete al > movzx eax, al # > ret > > 11 instructions in 40 bytes # 10 instructions in 36 bytes > > OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set" > here instead of the native SSE4.1 alias "Penryn New Instruction Set" > of the Core2 (and all later processors)? > > OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the > right side? > > > Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1 > alias "Penryn New Instruction Set" of the Core2 processor: > > GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c > > # https://godbolt.org/z/svhEoYT11 > ispowerof2(unsigned long long): > # xor eax, eax > movq xmm1, [esp+4] # movq xmm1, [esp+4] > pcmpeqd xmm0, xmm0 # pcmpeqq xmm0, xmm0 > paddq xmm0, xmm1 # paddq xmm0, xmm1 > pand xmm0, xmm1 # ptest xmm0, xmm1 > movd edx, xmm0 # > psrlq xmm0, 32 # > movd eax, xmm0 # > or edx, eax # > sete al # sete al > movzx eax, al # > ret # ret > > 11 instructions in 40 bytes # 7 instructions in 26 bytes > > OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Last compile with -mtune=i386 for the i386 processor: > > GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c > > # https://godbolt.org/z/e76W6dsMj > ispowerof2(unsigned long long): > push ebx # > mov ecx, [esp+8] # mov eax, [esp+4] > mov ebx, [esp+12] # mov edx, [esp+8] > mov eax, ecx # > mov edx, ebx # > add eax, -1 # add eax, -1 > adc edx, -1 # adc edx, -1 > and eax, ecx # and eax, [esp+4] > and edx, ebx # and edx, [esp+8] > or eax, edx # or eax, edx > sete al # neg eax > movzx eax, al # sbb eax, eax > pop ebx # inc eax > ret # ret > > 14 instructions in 33 bytes # 11 instructions in 32 bytes > > OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous > memory write? > > > Stefan Kanthak