"Richard Biener" <richard.guent...@gmail.com> wrote:

> On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kant...@nexgo.de> 
> wrote:
>>
>> "Allan Sandfeld Jensen" <li...@carewolf.com> wrote:
>>
>> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> >> Hi @ll,
>> >>
>> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> >> Matt Godbolt used the function
>> >>
>> >> | bool isWhitespace(char c)
>> >> | {
>> >> |
>> >> |     return c == ' '
>> >> |
>> >> |       || c == '\r'
>> >> |       || c == '\n'
>> >> |       || c == '\t';
>> >> |
>> >> | }
>> >>
>> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> >>
>> >> processors (see <https://godbolt.org/z/acm19_conds>):
>> >> |    xor    eax, eax              ; result = false
>> >> |    cmp    dil, 32               ; is c > 32
>> >> |    ja     .L4                   ; if so, exit with false
>> >> |    movabs rax, 4294977024       ; rax = 0x100002600
>> >> |    shrx   rax, rax, rdi         ; rax >>= c
>> >> |    and    eax, 1                ; result = rax & 1
>> >> |
>> >> |.L4:
>> >> |    ret

[...]
 
> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with

Correct.

> but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> branch

Also correct, but you miss a point: the typical use case is

    while (isWhitespace(*ptr)) ptr++;

> which means the conditional branch sequence will be usually faster.

And this is wrong!
The (well-predicted) branch is usually NOT taken, so both code variants
usually execute (with one exception the same) 6 or 7 instructions.

> The proposed change turns the control into a data dependence which
> constrains instruction scheduling and retirement.

It doesn't matter: the branch has the same data dependency too!

> Indeed a mispredicted branch will likely be more costly.

And no branch is even better: the branch predictor has a limited capacity,
so every removed branch instruction can help improve its efficiency.

> x86 CPUs do not perform data speculation.

>>          mov     ecx, edi
>>          movabs  rax, 4294977024
>>          shr     rax, cl
>>          xor     edi, edi
>>          cmp     ecx, 33
>>          setb    dil
>>          and     eax, edi

I already presented measured numbers: with random data, the branch-free
code is faster, with ordered data the original code.

Left column 1 billion sequential characters
    for (int i=1000000000; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:           2.4    3.4
branch-free:   3.0    2.5

Now perform a linear interpolation and find the break-even point at
p=0.4, with p=0 for ordered data and p=1 for random data, or just use
the average of these numbers: 2.9 cycles vs. 2.75 cycles.
That's small, but measurable!

Stefan

Reply via email to