"Nathan Sidwell" <nat...@acm.org> wrote:

> On 8/14/20 12:43 PM, Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> |     return c == ' '
>> |       || c == '\r'
>> |       || c == '\n'
>> |       || c == '\t';
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> processors (see <https://godbolt.org/z/acm19_conds>):
>> 
>> |    xor    eax, eax              ; result = false
>> |    cmp    dil, 32               ; is c > 32
>> |    ja     .L4                   ; if so, exit with false
>> |    movabs rax, 4294977024       ; rax = 0x100002600
>> |    shrx   rax, rax, rdi         ; rax >>= c
>> |    and    eax, 1                ; result = rax & 1
>> |.L4:
>> |    ret
>> 
>> This code is but not optimal!
> 
> What evidence do you have that your alternative sequence performs
> better?

45+ years experience in writing assembly code!

> Have you benchmarked it?

Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):

GCC:                      x64:                       x86:
    xor    eax, eax           movabs rax, 100002600h     mov   eax, 2600h
    cmp    cl, 32                                        test  cl, cl
    ja     L4                                            setnz al
    movabs rax, 100002600h    shr    rax, cl             shr   eax, cl
    shr    rax, cl            xor    edx, edx            xor   edx, edx
    and    eax, 1             cmp    ecx, 33             cmp   edx, 33
                              setb   dl                  setb  dl
L4:                           and    eax, edx            and   eax, edx
    ret                       ret                        ret

Left column 1 billion sequential characters
    for (int i=1000000000; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.4    3.4
x64:   3.0    2.5
x86:   4.3    4.5

> (I tried, but your code doesn't assemble)

Wrong: YOUR code doesn't assemble, mine does! And it works well too.

> It is more instructions

Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.
I expect people commenting on assembly to understand (just a little bit)
of it and to get the idea of the code I posted first, then eventually be
able to come up with the following x86_64 code (without a fancy SHRX,
i.e. for ALL AMD64 processors, not just those supporting BMI2):

     mov    ecx, edi
     movabs rax, 100002600h       ; rax = (1 << ' ') | (1 << '\r') | (1 << 
'\n') | (1 << '\t')
     shr    rax, cl               ; rax >>= c
     xor    edi, edi
     cmp    ecx, 33               ; CF = c <= ' '
.if 0
     adc    edi, edi              ; edi = (c <= ' ')
.else
     setb   dil
.endif
     and    eax, edi              ; result = rax & (c <= ' ')
     ret


If someone (or GCC) but really insists to (ab)use SHRX:

     xor    ecx, ecx
     cmp    dil, 33               ; is !(c > 32)
     setb   cl
     movabs rax, 4294977024       ; rax = 0x100002600
     shrx   rax, rax, rdi         ; rax >>= c
     and    eax, ecx              ; result = rax & (c <= ' ')
     ret

Now start counting: first the instructions, then the cycles lost
due to wrong branch prediction (hint: 0 for my code).

> and cannot speculate past the setnz

OUCH!
Unfortunately, AMD64 and i386 processors are not THAT retarded.

> (As I understand it, x86_64 speculates branch instructions, but
> doesn't speculate cmov -- so perversely branches are faster!)

There's no CMOV in my code.
What are you trying to demonstrate?

JFTR: how much speculative executed instructions are discarded if
      the branch prediction is wrong?

>> The following equivalent and branchless code works on i386 too,
>> it needs neither an AMD64 processor nor the SHRX instruction,
>> which is not available on older processors:
> 
>> 
>>       mov    ecx, edi
>>       mov    eax, 2600h            ; eax = (1 << '\r') | (1 << '\n') | (1 << 
>> '\t')
>>       test   cl, cl
>>       setnz  al                    ; eax |= (c != '\0')
>>       shr    eax, cl               ; eax >>= (c % ' ')
> 
> ^^ operand type mismatch on this instruction

I recommend to fix YOUR typo!
I bet you wrote "shr eax, c" instead of "shr eax, cl" there.

>>       xor    edx, edx
>>       cmp    ecx, 33               ; CF = c <= ' '
>>       adc    edx, edx              ; edx = (c <= ' ')
>>       and    eax, edx
>>       ret

Stefan

Reply via email to