On 5/26/23 02:46, Stefan Kanthak wrote:

Hi,

compile the following function on a system with Core2 processor
(released January 2008) for the 32-bit execution environment:

--- demo.c ---
int ispowerof2(unsigned long long argument)
{
     return (argument & argument - 1) == 0;
}
--- EOF ---

GCC 13.3: gcc -m32 -O3 demo.c

NOTE: -mtune=native is the default!

# https://godbolt.org/z/b43cjGdY9
ispowerof2(unsigned long long):
         movq    xmm1, [esp+4]
         pcmpeqd xmm0, xmm0
         paddq   xmm0, xmm1
         pand    xmm0, xmm1
         movd    edx, xmm0      #    pxor    xmm1, xmm1
         psrlq   xmm0, 32       #    pcmpeqb xmm0, xmm1
         movd    eax, xmm0      #    pmovmskb eax, xmm0
         or      edx, eax       #    cmp     al, 255
         sete    al             #    sete    al
         movzx   eax, al        #
         ret

11 instructions in 40 bytes # 10 instructions in 36 bytes

You cannot delete the 'movzx eax, al' instruction. The line "(argument & argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required to ensure that the upper 24-bits of the eax register are properly zeroed.


OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
       here instead of the native SSE4.1 alias "Penryn New Instruction Set"
       of the Core2 (and all later processors)?

OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
right side?
After correcting for the above error, your solution is is the same size as the solution gcc generated. Therefore, the only remaining question would be "Is your solution faster than the code gcc produced?"

If you claim it is, I'd like to see evidence supporting that claim.
Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
alias "Penryn New Instruction Set" of the Core2 processor:

GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c

# https://godbolt.org/z/svhEoYT11
ispowerof2(unsigned long long):
                                #    xor      eax, eax
         movq    xmm1, [esp+4]  #    movq     xmm1, [esp+4]
         pcmpeqd xmm0, xmm0     #    pcmpeqq  xmm0, xmm0
         paddq   xmm0, xmm1     #    paddq    xmm0, xmm1
         pand    xmm0, xmm1     #    ptest    xmm0, xmm1
         movd    edx, xmm0      #
         psrlq   xmm0, 32       #
         movd    eax, xmm0      #
         or      edx, eax       #
         sete    al             #    sete     al
         movzx   eax, al        #
         ret                    #    ret

11 instructions in 40 bytes    # 7 instructions in 26 bytes

OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As pointed out elsewhere in this thread, you used the wrong flags. With the proper flags, I get

% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c  && objdump -d ispowerof2.o


ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
   0:   f3 0f 7e 4c 24 04       movq   0x4(%esp),%xmm1
   6:   66 0f 76 c0             pcmpeqd %xmm0,%xmm0
   a:   31 c0                   xor    %eax,%eax
   c:   66 0f d4 c1             paddq  %xmm1,%xmm0
  10:   66 0f db c1             pand   %xmm1,%xmm0
  14:   66 0f 6c c0             punpcklqdq %xmm0,%xmm0
  18:   66 0f 38 17 c0          ptest  %xmm0,%xmm0
  1d:   0f 94 c0                sete   %al
  20:   c3                      ret

so with just the SSE-4.1 instruction set the output is 31 bytes long.

Last compile with -mtune=i386 for the i386 processor:

GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c

# https://godbolt.org/z/e76W6dsMj
ispowerof2(unsigned long long):
         push    ebx            #
         mov     ecx, [esp+8]   #    mov    eax, [esp+4]
         mov     ebx, [esp+12]  #    mov    edx, [esp+8]
         mov     eax, ecx       #
         mov     edx, ebx       #
         add     eax, -1        #    add    eax, -1
         adc     edx, -1        #    adc    edx, -1
         and     eax, ecx       #    and    eax, [esp+4]
         and     edx, ebx       #    and    edx, [esp+8]
         or      eax, edx       #    or     eax, edx
         sete    al             #    neg    eax
         movzx   eax, al        #    sbb    eax, eax
         pop     ebx            #    inc    eax
         ret                    #    ret

14 instructions in 33 bytes    # 11 instructions in 32 bytes

OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
       memory write?

At -O1 gcc produces:

% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c  && objdump -Mintel -d ispowerof2.o

ispowerof2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <ispowerof2>:
   0:   8b 44 24 04             mov    eax,DWORD PTR [esp+0x4]
   4:   8b 54 24 08             mov    edx,DWORD PTR [esp+0x8]
   8:   83 c0 ff                add    eax,0xffffffff
   b:   83 d2 ff                adc    edx,0xffffffff
   e:   23 44 24 04             and    eax,DWORD PTR [esp+0x4]
  12:   23 54 24 08             and    edx,DWORD PTR [esp+0x8]
  16:   09 d0                   or     eax,edx
  18:   0f 94 c0                sete   al
  1b:   0f b6 c0                movzx  eax,al
  1e:   c3                      ret

which is 1 instruction and 1 byte shorter than your proposed solution.

However, at -O2 or -O3 it produces the code you mention above. The reason for that is simple. It's faster to read from registers than it is to read from cache or RAM, and gcc is taking advantage of that fact when optimizing at -O2 or higher.


Stefan Kanthak

Reply via email to