https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165
Bug ID: 93165 Summary: avoidable 2x penalty on unpredicted overwrite Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- === Abstract: In important cases where using a "conditional move" instruction would provide a 2x performance improvement, Gcc fails to generate the instruction. In testing, a random field is sorted by the simplest possible Quicksort algorithm, varying only the method for conditionally swapping values, with widely varying runtimes, many substantially faster than `std::sort`. The 2x result is obtainable only with Clang, which for amd64 generates cmov instructions. A simple testing apparatus may be found at https://github.com/ncm/sortfast/ that depends only on shell tools. === Details: The basic sort is: ``` int* partition(int* begin, int* end) { int pivot = end[-1]; int* left = begin; for (int* right = begin; right < end - 1; ++right) { if (*right <= pivot) { std::swap(*left, *right); ++left; } } int tmp = *left; *left = end[-1], end[-1] = tmp; return left; } void quicksort(int* begin, int* end) { while (end - begin > 1) { int* mid = partition(begin, end); quicksort(begin, mid); begin = mid + 1; } } ``` which runs about as fast as `std::sort`, on truly random input. Replacing the body of the loop in `partition()` above with ``` left += swap_if(*right <= pivot, *left, *right); ``` where `swap_if` is defined ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, mask = -c; a = (b & mask) | (ta & ~mask); b = (ta & mask) | (b & ~mask); return c; } ``` the sort is substantially faster than `std::sort` compiled with Gcc. Compiled with Clang 8 or later, it runs fully 2x as fast as `std::sort`. Clang recognizes the pattern and substitutes `cmov` instructions. (Clang also unrolls the loop, which helps a little.) Another formulation, ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = c ? tb : ta; b = c ? ta : tb; return c; } ``` also results in the same object code, with Clang, but is 2x slower compiled with Gcc, which generates a branch. A third formulation, ``` inline bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; b = v[1-c], a = v[c]; return c; } ``` is much faster than `std::sort` compiled by both Gcc and Clang, but detours values through L1 cache, at some cost, so is slower than the `cmov` version. A fourth version, ``` inline bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; a = v[c], b = v[!c]; return c; } ``` is about the same speed as the third when built with Clang, but with Gcc is quite a lot slower than `std::sort`. Order matters, somehow, as does the choice of operator. (I don't know how to express a bug report for this last case. Advice welcome.) In ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = c ? tb : ta; // b = c ? ta : tb; return c; } ``` (at least when it is not inlined) Gcc seems happy to generate the `cmov` instruction. Apparently the optimization code is very jealous about what else is allowed in the basic block where a `cmov` is considered. Finally, even with ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = __builtin_expect_with_probability(c, 0, 0.5) ? tb : ta; b = __builtin_expect_with_probability(c, 0, 0.5) ? ta : tb; return c; } ``` Gcc still will not generate the `cmov` instructions. === Discussion: Replacing a branch with `cmov` may result in slower code, particularly on older CPU targets. However, when the programmer provides direct information that the branch is unpredictable, it seems like the compiler should be willing to act on that expectation. In that light, Clang's conversion of "`(mask & a)|(~mask & b)`" to `cmov` seems to be universally correct. It is a portable formula that gives better results than the branching version even without specific optimization, but may easily be rewritten using `cmov` for even better results. In addition, when `__builtin_expect_with_probability` is used to indicate unpredictability, there seems to be no defensible reason not to rewrite the expression to use `cmov`. Finally, the indexed form of `swap_if` may be recognized and turned into `cmov` instructions without worry that a predictable branch has been replaced, avoiding the unnecessary detour through memory.