On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak <stefan.kant...@nexgo.de> wrote: > > "Richard Biener" <richard.guent...@gmail.com> wrote: > > > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kant...@nexgo.de> > > wrote: > >> > >> "Allan Sandfeld Jensen" <li...@carewolf.com> wrote: > >> > >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote: > >> >> Hi @ll, > >> >> > >> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>, > >> >> Matt Godbolt used the function > >> >> > >> >> | bool isWhitespace(char c) > >> >> | { > >> >> | > >> >> | return c == ' ' > >> >> | > >> >> | || c == '\r' > >> >> | || c == '\n' > >> >> | || c == '\t'; > >> >> | > >> >> | } > >> >> > >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64 > >> >> > >> >> processors (see <https://godbolt.org/z/acm19_conds>): > >> >> | xor eax, eax ; result = false > >> >> | cmp dil, 32 ; is c > 32 > >> >> | ja .L4 ; if so, exit with false > >> >> | movabs rax, 4294977024 ; rax = 0x100002600 > >> >> | shrx rax, rax, rdi ; rax >>= c > >> >> | and eax, 1 ; result = rax & 1 > >> >> | > >> >> |.L4: > >> >> | ret > > [...] > > > Whether or not the conditional branch sequence is faster depends on whether > > the branch is well-predicted which very much depends on the data you > > feed the isWhitespace function with > > Correct. > > > but I guess since this is the c == ' ' test it _will_ be a well-predicted > > branch > > Also correct, but you miss a point: the typical use case is > > while (isWhitespace(*ptr)) ptr++; > > > which means the conditional branch sequence will be usually faster. > > And this is wrong! > The (well-predicted) branch is usually NOT taken, so both code variants > usually execute (with one exception the same) 6 or 7 instructions.
Whether or not the branch is predicted taken does not matter, what matters is that the continuation is not data dependent on the branch target computation and thus can execute in parallel to it. > > The proposed change turns the control into a data dependence which > > constrains instruction scheduling and retirement. > > It doesn't matter: the branch has the same data dependency too! > > > Indeed a mispredicted branch will likely be more costly. > > And no branch is even better: the branch predictor has a limited capacity, > so every removed branch instruction can help improve its efficiency. > > > x86 CPUs do not perform data speculation. > > >> mov ecx, edi > >> movabs rax, 4294977024 > >> shr rax, cl > >> xor edi, edi > >> cmp ecx, 33 > >> setb dil > >> and eax, edi > > I already presented measured numbers: with random data, the branch-free > code is faster, with ordered data the original code. > > Left column 1 billion sequential characters > for (int i=1000000000; i; --i) ...(i); > right column 1 billion random characters, in cycles per character: I guess feeding it Real Text (TM) is the only relevant benchmark, doing sth like for (;;) cnt[isWhitespace(*ptr++)]++; > GCC: 2.4 3.4 > branch-free: 3.0 2.5 I'd call that unconclusive data - you also failed to show your test data is somehow relevant. We do know that mispredicted branches are bad. You show well-predicted branches are good. By simple statistics singling out 4 out of 255 values will make the branches well-predicted. > Now perform a linear interpolation and find the break-even point at > p=0.4, with p=0 for ordered data and p=1 for random data, or just use > the average of these numbers: 2.9 cycles vs. 2.75 cycles. > That's small, but measurable!