https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

            Bug ID: 113682
           Summary: Branches in branchless binary search rather than
                    cmov/csel/csinc
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

I've been trying to eliminate unpredictable branches in a hot function where
perf counters show a high mispredict percentage. Unfortunately, I haven't been
able to find an incantation to get gcc to generate branchless code other than
inline asm, which I'd rather avoid. In this case I've even laid out the
critical lines so that they exactly match the behavior of the csinc and csel
instructions on arm64, but they are still not used.


Somewhat minimized repro:

typedef unsigned long size_t;
struct ITEM {char* addr; size_t len;};
int cmp(ITEM* user, ITEM* tree);

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
    auto pos = tree;
    size_t low = 0;
    size_t high = treeSize;
    while (low < high) {
        size_t i = (low + high) / 2;
        int res = cmp(user, tree[i]);

        // These should be cmp + csinc + csel on arm
        // and lea + test + cmov + cmov on x86.
        low = res > 0 ? i + 1 : low; // csinc
        high = res < 0 ? i: high; // csel

        if (res == 0) return i;
    }
    return -1;
}


On arm64 that generates a conditional branch on res > 0:
        bl      cmp(ITEM*, ITEM*)
        cmp     w0, 0
        bgt     .L15 // does low = i + 1 then loops
        mov     x20, x19
        bne     .L4 // loop


On x86_64 it does similar:
        call    cmp(ITEM*, ITEM*)
        test    eax, eax
        jg      .L16 
        jne     .L17


Note that clang produces the desired codegen for both:

arm:
        bl      cmp(ITEM*, ITEM*)
        cmp     w0, #0
        csinc   x23, x23, x22, le
        csel    x19, x22, x19, lt
        cbnz    w0, .LBB0_1

x86:
        call    cmp(ITEM*, ITEM*)@PLT
        lea     rcx, [r12 + 1]
        test    eax, eax
        cmovg   r13, rcx
        cmovs   rbx, r12
        jne     .LBB0_1

(full output for all 4 available at https://www.godbolt.org/z/aWrKbYPTG. Code
snippets from trunk, but also repos on 13.2)

While ideally gcc would generate the branchless output for the supplied code,
if there is some (reasonable) incantation that would cause it to produce
branchless output, I'd be happy to have that too.

Reply via email to