"Richard Biener" <richard.guent...@gmail.com> wrote: > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kant...@nexgo.de> > wrote: >> >> "Allan Sandfeld Jensen" <li...@carewolf.com> wrote: >> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote: >> >> Hi @ll, >> >> >> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>, >> >> Matt Godbolt used the function >> >> >> >> | bool isWhitespace(char c) >> >> | { >> >> | >> >> | return c == ' ' >> >> | >> >> | || c == '\r' >> >> | || c == '\n' >> >> | || c == '\t'; >> >> | >> >> | } >> >> >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64 >> >> >> >> processors (see <https://godbolt.org/z/acm19_conds>): >> >> | xor eax, eax ; result = false >> >> | cmp dil, 32 ; is c > 32 >> >> | ja .L4 ; if so, exit with false >> >> | movabs rax, 4294977024 ; rax = 0x100002600 >> >> | shrx rax, rax, rdi ; rax >>= c >> >> | and eax, 1 ; result = rax & 1 >> >> | >> >> |.L4: >> >> | ret
[...] > Whether or not the conditional branch sequence is faster depends on whether > the branch is well-predicted which very much depends on the data you > feed the isWhitespace function with Correct. > but I guess since this is the c == ' ' test it _will_ be a well-predicted > branch Also correct, but you miss a point: the typical use case is while (isWhitespace(*ptr)) ptr++; > which means the conditional branch sequence will be usually faster. And this is wrong! The (well-predicted) branch is usually NOT taken, so both code variants usually execute (with one exception the same) 6 or 7 instructions. > The proposed change turns the control into a data dependence which > constrains instruction scheduling and retirement. It doesn't matter: the branch has the same data dependency too! > Indeed a mispredicted branch will likely be more costly. And no branch is even better: the branch predictor has a limited capacity, so every removed branch instruction can help improve its efficiency. > x86 CPUs do not perform data speculation. >> mov ecx, edi >> movabs rax, 4294977024 >> shr rax, cl >> xor edi, edi >> cmp ecx, 33 >> setb dil >> and eax, edi I already presented measured numbers: with random data, the branch-free code is faster, with ordered data the original code. Left column 1 billion sequential characters for (int i=1000000000; i; --i) ...(i); right column 1 billion random characters, in cycles per character: GCC: 2.4 3.4 branch-free: 3.0 2.5 Now perform a linear interpolation and find the break-even point at p=0.4, with p=0 for ordered data and p=1 for random data, or just use the average of these numbers: 2.9 cycles vs. 2.75 cycles. That's small, but measurable! Stefan