On 5/26/23 02:46, Stefan Kanthak wrote:
Hi,
compile the following function on a system with Core2 processor
(released January 2008) for the 32-bit execution environment:
--- demo.c ---
int ispowerof2(unsigned long long argument)
{
return (argument & argument - 1) == 0;
}
--- EOF ---
GCC 13.3: gcc -m32 -O3 demo.c
NOTE: -mtune=native is the default!
# https://godbolt.org/z/b43cjGdY9
ispowerof2(unsigned long long):
movq xmm1, [esp+4]
pcmpeqd xmm0, xmm0
paddq xmm0, xmm1
pand xmm0, xmm1
movd edx, xmm0 # pxor xmm1, xmm1
psrlq xmm0, 32 # pcmpeqb xmm0, xmm1
movd eax, xmm0 # pmovmskb eax, xmm0
or edx, eax # cmp al, 255
sete al # sete al
movzx eax, al #
ret
11 instructions in 40 bytes # 10 instructions in 36 bytes
You cannot delete the 'movzx eax, al' instruction. The line "(argument &
argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required
to ensure that the upper 24-bits of the eax register are properly zeroed.
OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
here instead of the native SSE4.1 alias "Penryn New Instruction Set"
of the Core2 (and all later processors)?
OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
right side?
After correcting for the above error, your solution is is the same size
as the solution gcc generated. Therefore, the only remaining question
would be "Is your solution faster than the code gcc produced?"
If you claim it is, I'd like to see evidence supporting that claim.
Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
alias "Penryn New Instruction Set" of the Core2 processor:
GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c
# https://godbolt.org/z/svhEoYT11
ispowerof2(unsigned long long):
# xor eax, eax
movq xmm1, [esp+4] # movq xmm1, [esp+4]
pcmpeqd xmm0, xmm0 # pcmpeqq xmm0, xmm0
paddq xmm0, xmm1 # paddq xmm0, xmm1
pand xmm0, xmm1 # ptest xmm0, xmm1
movd edx, xmm0 #
psrlq xmm0, 32 #
movd eax, xmm0 #
or edx, eax #
sete al # sete al
movzx eax, al #
ret # ret
11 instructions in 40 bytes # 7 instructions in 26 bytes
OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As pointed out elsewhere in this thread, you used the wrong flags. With
the proper flags, I get
% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c && objdump -d
ispowerof2.o
ispowerof2.o: file format elf32-i386
Disassembly of section .text:
00000000 <ispowerof2>:
0: f3 0f 7e 4c 24 04 movq 0x4(%esp),%xmm1
6: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0
a: 31 c0 xor %eax,%eax
c: 66 0f d4 c1 paddq %xmm1,%xmm0
10: 66 0f db c1 pand %xmm1,%xmm0
14: 66 0f 6c c0 punpcklqdq %xmm0,%xmm0
18: 66 0f 38 17 c0 ptest %xmm0,%xmm0
1d: 0f 94 c0 sete %al
20: c3 ret
so with just the SSE-4.1 instruction set the output is 31 bytes long.
Last compile with -mtune=i386 for the i386 processor:
GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c
# https://godbolt.org/z/e76W6dsMj
ispowerof2(unsigned long long):
push ebx #
mov ecx, [esp+8] # mov eax, [esp+4]
mov ebx, [esp+12] # mov edx, [esp+8]
mov eax, ecx #
mov edx, ebx #
add eax, -1 # add eax, -1
adc edx, -1 # adc edx, -1
and eax, ecx # and eax, [esp+4]
and edx, ebx # and edx, [esp+8]
or eax, edx # or eax, edx
sete al # neg eax
movzx eax, al # sbb eax, eax
pop ebx # inc eax
ret # ret
14 instructions in 33 bytes # 11 instructions in 32 bytes
OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
memory write?
At -O1 gcc produces:
% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c && objdump
-Mintel -d ispowerof2.o
ispowerof2.o: file format elf32-i386
Disassembly of section .text:
00000000 <ispowerof2>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 8b 54 24 08 mov edx,DWORD PTR [esp+0x8]
8: 83 c0 ff add eax,0xffffffff
b: 83 d2 ff adc edx,0xffffffff
e: 23 44 24 04 and eax,DWORD PTR [esp+0x4]
12: 23 54 24 08 and edx,DWORD PTR [esp+0x8]
16: 09 d0 or eax,edx
18: 0f 94 c0 sete al
1b: 0f b6 c0 movzx eax,al
1e: c3 ret
which is 1 instruction and 1 byte shorter than your proposed solution.
However, at -O2 or -O3 it produces the code you mention above. The
reason for that is simple. It's faster to read from registers than it is
to read from cache or RAM, and gcc is taking advantage of that fact when
optimizing at -O2 or higher.
Stefan Kanthak