On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote:
>
> "Richard Biener" <richard.guent...@gmail.com> wrote:
>
> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kant...@nexgo.de> 
> > wrote:
> >>
> >> "Allan Sandfeld Jensen" <li...@carewolf.com> wrote:
> >>
> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> >> Hi @ll,
> >> >>
> >> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
> >> >> Matt Godbolt used the function
> >> >>
> >> >> | bool isWhitespace(char c)
> >> >> | {
> >> >> |
> >> >> |     return c == ' '
> >> >> |
> >> >> |       || c == '\r'
> >> >> |       || c == '\n'
> >> >> |       || c == '\t';
> >> >> |
> >> >> | }
> >> >>
> >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >> >>
> >> >> processors (see <https://godbolt.org/z/acm19_conds>):
> >> >> |    xor    eax, eax              ; result = false
> >> >> |    cmp    dil, 32               ; is c > 32
> >> >> |    ja     .L4                   ; if so, exit with false
> >> >> |    movabs rax, 4294977024       ; rax = 0x100002600
> >> >> |    shrx   rax, rax, rdi         ; rax >>= c
> >> >> |    and    eax, 1                ; result = rax & 1
> >> >> |
> >> >> |.L4:
> >> >> |    ret
>
> [...]
>
> > Whether or not the conditional branch sequence is faster depends on whether
> > the branch is well-predicted which very much depends on the data you
> > feed the isWhitespace function with
>
> Correct.
>
> > but I guess since this is the c == ' ' test it _will_ be a well-predicted 
> > branch
>
> Also correct, but you miss a point: the typical use case is
>
>     while (isWhitespace(*ptr)) ptr++;
>
> > which means the conditional branch sequence will be usually faster.
>
> And this is wrong!
> The (well-predicted) branch is usually NOT taken, so both code variants
> usually execute (with one exception the same) 6 or 7 instructions.
Whether or not the branch is predicted taken does not matter, what
matters is that the continuation is not data dependent on the branch
target computation and thus can execute in parallel to it.

> > The proposed change turns the control into a data dependence which
> > constrains instruction scheduling and retirement.
>
> It doesn't matter: the branch has the same data dependency too!
>
> > Indeed a mispredicted branch will likely be more costly.
>
> And no branch is even better: the branch predictor has a limited capacity,
> so every removed branch instruction can help improve its efficiency.
>
> > x86 CPUs do not perform data speculation.
>
> >>          mov     ecx, edi
> >>          movabs  rax, 4294977024
> >>          shr     rax, cl
> >>          xor     edi, edi
> >>          cmp     ecx, 33
> >>          setb    dil
> >>          and     eax, edi
>
> I already presented measured numbers: with random data, the branch-free
> code is faster, with ordered data the original code.
>
> Left column 1 billion sequential characters
>     for (int i=1000000000; i; --i) ...(i);
> right column 1 billion random characters, in cycles per character:

I guess feeding it Real Text (TM) is the only relevant benchmark,
doing sth like

  for (;;)
     cnt[isWhitespace(*ptr++)]++;

> GCC:           2.4    3.4
> branch-free:   3.0    2.5

I'd call that unconclusive data - you also failed to show your test data
is somehow relevant.  We do know that mispredicted branches are bad.
You show well-predicted branches are good.  By simple statistics
singling out 4 out of 255 values will make the branches well-predicted.

> Now perform a linear interpolation and find the break-even point at
> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
> That's small, but measurable!

Reply via email to