"Nathan Sidwell" <[email protected]> wrote:
> On 8/14/20 12:43 PM, Stefan Kanthak wrote:
>> Hi @ll,
>>
>> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> Matt Godbolt used the function
>>
>> | bool isWhitespace(char c)
>> | {
>> | return c == ' '
>> | || c == '\r'
>> | || c == '\n'
>> | || c == '\t';
>> | }
>>
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> processors (see <https://godbolt.org/z/acm19_conds>):
>>
>> | xor eax, eax ; result = false
>> | cmp dil, 32 ; is c > 32
>> | ja .L4 ; if so, exit with false
>> | movabs rax, 4294977024 ; rax = 0x100002600
>> | shrx rax, rax, rdi ; rax >>= c
>> | and eax, 1 ; result = rax & 1
>> |.L4:
>> | ret
>>
>> This code is but not optimal!
>
> What evidence do you have that your alternative sequence performs
> better?
45+ years experience in writing assembly code!
> Have you benchmarked it?
Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):
GCC: x64: x86:
xor eax, eax movabs rax, 100002600h mov eax, 2600h
cmp cl, 32 test cl, cl
ja L4 setnz al
movabs rax, 100002600h shr rax, cl shr eax, cl
shr rax, cl xor edx, edx xor edx, edx
and eax, 1 cmp ecx, 33 cmp edx, 33
setb dl setb dl
L4: and eax, edx and eax, edx
ret ret ret
Left column 1 billion sequential characters
for (int i=1000000000; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:
GCC: 2.4 3.4
x64: 3.0 2.5
x86: 4.3 4.5
> (I tried, but your code doesn't assemble)
Wrong: YOUR code doesn't assemble, mine does! And it works well too.
> It is more instructions
Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.
I expect people commenting on assembly to understand (just a little bit)
of it and to get the idea of the code I posted first, then eventually be
able to come up with the following x86_64 code (without a fancy SHRX,
i.e. for ALL AMD64 processors, not just those supporting BMI2):
mov ecx, edi
movabs rax, 100002600h ; rax = (1 << ' ') | (1 << '\r') | (1 <<
'\n') | (1 << '\t')
shr rax, cl ; rax >>= c
xor edi, edi
cmp ecx, 33 ; CF = c <= ' '
.if 0
adc edi, edi ; edi = (c <= ' ')
.else
setb dil
.endif
and eax, edi ; result = rax & (c <= ' ')
ret
If someone (or GCC) but really insists to (ab)use SHRX:
xor ecx, ecx
cmp dil, 33 ; is !(c > 32)
setb cl
movabs rax, 4294977024 ; rax = 0x100002600
shrx rax, rax, rdi ; rax >>= c
and eax, ecx ; result = rax & (c <= ' ')
ret
Now start counting: first the instructions, then the cycles lost
due to wrong branch prediction (hint: 0 for my code).
> and cannot speculate past the setnz
OUCH!
Unfortunately, AMD64 and i386 processors are not THAT retarded.
> (As I understand it, x86_64 speculates branch instructions, but
> doesn't speculate cmov -- so perversely branches are faster!)
There's no CMOV in my code.
What are you trying to demonstrate?
JFTR: how much speculative executed instructions are discarded if
the branch prediction is wrong?
>> The following equivalent and branchless code works on i386 too,
>> it needs neither an AMD64 processor nor the SHRX instruction,
>> which is not available on older processors:
>
>>
>> mov ecx, edi
>> mov eax, 2600h ; eax = (1 << '\r') | (1 << '\n') | (1 <<
>> '\t')
>> test cl, cl
>> setnz al ; eax |= (c != '\0')
>> shr eax, cl ; eax >>= (c % ' ')
>
> ^^ operand type mismatch on this instruction
I recommend to fix YOUR typo!
I bet you wrote "shr eax, c" instead of "shr eax, cl" there.
>> xor edx, edx
>> cmp ecx, 33 ; CF = c <= ' '
>> adc edx, edx ; edx = (c <= ' ')
>> and eax, edx
>> ret
Stefan