https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120952
Bug ID: 120952
Summary: GCC fails to optimize division to shift when divisor
is non-constant power of two
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: redi at gcc dot gnu.org
Target Milestone: ---
Given std::bit_ceil (which returns the smallest power of two not less than its
argument), I hoped the following would be optimized to use a shift:
#include <bit>
using std::size_t;
size_t get_block_size(size_t bytes, size_t alignment)
{
alignment = std::__bit_ceil(alignment);
if (!std::__has_single_bit(alignment)) // definitely a power of two now!
__builtin_unreachable();
return ((bytes + alignment - 1) / alignment) * alignment;
}
on x86_64 GCC optimizes this to:
"get_block_size(unsigned long, unsigned long)":
mov rax, rdi
mov edi, 1
cmp rsi, 1
jbe .L2
sub rsi, 1
bsr rsi, rsi
lea ecx, [rsi+1]
sal rdi, cl
.L2:
lea rcx, [rdi-1+rax]
xor edx, edx
mov rax, rcx
div rdi
mov rax, rcx
sub rax, rdx
ret
Whereas clang uses a shift:
get_block_size(unsigned long, unsigned long):
mov eax, 1
cmp rsi, 2
jb .LBB0_2
dec rsi
mov ecx, 127
bsr rcx, rsi
xor ecx, 63
neg cl
shl rax, cl
.LBB0_2:
lea rcx, [rdi + rax]
dec rcx
neg rax
and rax, rcx
ret
If I remove the bit_ceil and just tell the compiler it's already a power of two
we avoid the branch but GCC still uses a shift:
using size_t = decltype(sizeof(0));
size_t get_block_size(size_t bytes, size_t alignment)
{
if (alignment & (alignment - 1))
__builtin_unreachable();
return ((bytes + alignment - 1) / alignment) * alignment;
}
GCC still uses DIV:
"get_block_size(unsigned long, unsigned long)":
lea rax, [rsi-1+rdi]
xor edx, edx
div rsi
lea rax, [rsi-1+rdi]
sub rax, rdx
ret
And Clang still shifts:
get_block_size(unsigned long, unsigned long):
lea rax, [rdi + rsi]
dec rax
rep bsf rcx, rsi
shr rax, cl
imul rax, rsi
ret
On IRC Andrew P said that we don't track pop count on ssa names.