https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101973
Bug ID: 101973 Summary: subtraction of clz is not optimized Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: sven.koehler at gmail dot com Target Milestone: --- On Intel x86_64, the generated code for __builtin_clz(x) looks something like this: clz(x) = 63 - bsr(x) Since Intel does not seem to have a way to do 63-y in a single instruction, XOR is used instead and the actual assembly code corresponds to clz(x) = 63 ^ bsr(x). Since bsr(x) is in the range 0 to 63, the XOR with 63 is equivalent to 63-y. However, when we actually need the index of the most significant non-zero bit, we have another 63-y, as in this function: int bsr(unsigned long x) { return sizeof(x)*8 - 1 - __builtin_clzl(x); } With -O3, GCC emits the following assembly code: bsr: bsr rdi, rdi mov eax, 63 xor rdi, 63 sub eax, edi ret The XOR with 63 and the subtraction from 63 cancel each other out in this special case. LLVM/clang performs this optimization. One might also consider the arbitrary case of z-clz(x) as a test case. On Intel, this is equivalent to bsr(x)+(z-63).