On Thu, May 25, 2023 at 11:56 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote:
>
> Hi,
>
> compile the following function on a system with Core2 processor
> (released January 2008) for the 32-bit execution environment:
>
> --- demo.c ---
> int ispowerof2(unsigned long long argument)
> {
>     return (argument & argument - 1) == 0;
> }
> --- EOF ---
>
> GCC 13.3: gcc -m32 -O3 demo.c
>
> NOTE: -mtune=native is the default!

You need to use -march=native and not -mtune=native .... to turn on
the architecture features.

Thanks,
Andrew

>
> # https://godbolt.org/z/b43cjGdY9
> ispowerof2(unsigned long long):
>         movq    xmm1, [esp+4]
>         pcmpeqd xmm0, xmm0
>         paddq   xmm0, xmm1
>         pand    xmm0, xmm1
>         movd    edx, xmm0      #    pxor    xmm1, xmm1
>         psrlq   xmm0, 32       #    pcmpeqb xmm0, xmm1
>         movd    eax, xmm0      #    pmovmskb eax, xmm0
>         or      edx, eax       #    cmp     al, 255
>         sete    al             #    sete    al
>         movzx   eax, al        #
>         ret
>
> 11 instructions in 40 bytes    # 10 instructions in 36 bytes
>
> OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
>       here instead of the native SSE4.1 alias "Penryn New Instruction Set"
>       of the Core2 (and all later processors)?
>
> OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
>       right side?
>
>
> Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
> alias "Penryn New Instruction Set" of the Core2 processor:
>
> GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c
>
> # https://godbolt.org/z/svhEoYT11
> ispowerof2(unsigned long long):
>                                #    xor      eax, eax
>         movq    xmm1, [esp+4]  #    movq     xmm1, [esp+4]
>         pcmpeqd xmm0, xmm0     #    pcmpeqq  xmm0, xmm0
>         paddq   xmm0, xmm1     #    paddq    xmm0, xmm1
>         pand    xmm0, xmm1     #    ptest    xmm0, xmm1
>         movd    edx, xmm0      #
>         psrlq   xmm0, 32       #
>         movd    eax, xmm0      #
>         or      edx, eax       #
>         sete    al             #    sete     al
>         movzx   eax, al        #
>         ret                    #    ret
>
> 11 instructions in 40 bytes    # 7 instructions in 26 bytes
>
> OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
>       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Last compile with -mtune=i386 for the i386 processor:
>
> GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c
>
> # https://godbolt.org/z/e76W6dsMj
> ispowerof2(unsigned long long):
>         push    ebx            #
>         mov     ecx, [esp+8]   #    mov    eax, [esp+4]
>         mov     ebx, [esp+12]  #    mov    edx, [esp+8]
>         mov     eax, ecx       #
>         mov     edx, ebx       #
>         add     eax, -1        #    add    eax, -1
>         adc     edx, -1        #    adc    edx, -1
>         and     eax, ecx       #    and    eax, [esp+4]
>         and     edx, ebx       #    and    edx, [esp+8]
>         or      eax, edx       #    or     eax, edx
>         sete    al             #    neg    eax
>         movzx   eax, al        #    sbb    eax, eax
>         pop     ebx            #    inc    eax
>         ret                    #    ret
>
> 14 instructions in 33 bytes    # 11 instructions in 32 bytes
>
> OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
>       memory write?
>
>
> Stefan Kanthak

Reply via email to