[Bug tree-optimization/120952] New: GCC fails to optimize division to shift when divisor is non-constant power of two

redi at gcc dot gnu.org via Gcc-bugs Fri, 04 Jul 2025 02:23:21 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120952


            Bug ID: 120952
           Summary: GCC fails to optimize division to shift when divisor
                    is non-constant power of two
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redi at gcc dot gnu.org
  Target Milestone: ---

Given std::bit_ceil (which returns the smallest power of two not less than its
argument), I hoped the following would be optimized to use a shift:

#include <bit>
using std::size_t;

size_t get_block_size(size_t bytes, size_t alignment)
{
    alignment = std::__bit_ceil(alignment);
    if (!std::__has_single_bit(alignment)) // definitely a power of two now!
        __builtin_unreachable();
    return ((bytes + alignment - 1) / alignment) * alignment;
}

on x86_64 GCC optimizes this to:

"get_block_size(unsigned long, unsigned long)":
        mov     rax, rdi
        mov     edi, 1
        cmp     rsi, 1
        jbe     .L2
        sub     rsi, 1
        bsr     rsi, rsi
        lea     ecx, [rsi+1]
        sal     rdi, cl
.L2:
        lea     rcx, [rdi-1+rax]
        xor     edx, edx
        mov     rax, rcx
        div     rdi
        mov     rax, rcx
        sub     rax, rdx
        ret

Whereas clang uses a shift:

get_block_size(unsigned long, unsigned long):
        mov     eax, 1
        cmp     rsi, 2
        jb      .LBB0_2
        dec     rsi
        mov     ecx, 127
        bsr     rcx, rsi
        xor     ecx, 63
        neg     cl
        shl     rax, cl
.LBB0_2:
        lea     rcx, [rdi + rax]
        dec     rcx
        neg     rax
        and     rax, rcx
        ret


If I remove the bit_ceil and just tell the compiler it's already a power of two
we avoid the branch but GCC still uses a shift:

using size_t = decltype(sizeof(0));

size_t get_block_size(size_t bytes, size_t alignment)
{
    if (alignment & (alignment - 1))
        __builtin_unreachable();
    return ((bytes + alignment - 1) / alignment) * alignment;
}


GCC still uses DIV:

"get_block_size(unsigned long, unsigned long)":
        lea     rax, [rsi-1+rdi]
        xor     edx, edx
        div     rsi
        lea     rax, [rsi-1+rdi]
        sub     rax, rdx
        ret

And Clang still shifts:

get_block_size(unsigned long, unsigned long):
        lea     rax, [rdi + rsi]
        dec     rax
        rep       bsf rcx, rsi
        shr     rax, cl
        imul    rax, rsi
        ret

On IRC Andrew P said that we don't track pop count on ssa names.

[Bug tree-optimization/120952] New: GCC fails to optimize division to shift when divisor is non-constant power of two

Reply via email to