https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120020

            Bug ID: 120020
           Summary: [Optimization opportunity] Optimize repeated inline
                    assembly operations when semantically equivalent
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: a1343922569 at outlook dot com
  Target Milestone: ---

I would like to propose an optimization enhancement regarding inline assembly
operations in C++, specifically for cases where multiple non-volatile asm
statements perform similar or nearly identical operations.


Description:
Currently, when a function calls multiple other functions containing similar
inline assembly operations (e.g., one retrieving the quotient and another
retrieving the remainder from a division operation), GCC generates code that
repeats the same assembly instruction multiple times, even though a single
execution could provide all needed results.


Environment Information:
GCC version: GCC 16.0.0
System type: x86_64-linux-gnu


Complete Command Line:
g++ -O2 -S test.cpp


Reproduction C++ code (64-bit variable):
#include <cstdint>
using U = uint64_t;
struct A { U quo; U rem; };
U myDiv(U const& a, U const& b)
{
    U result, dummyDividendHigh;
    asm (
     "xor %[dividendHigh], %[dividendHigh]  \n\t"
     "divq %[divisor]  \n\t"
     : [quotient] "=a"(result), [dividendHigh] "=d" (dummyDividendHigh)
     : [dividendLow] "a"(a), [divisor] "rm"(b)
    );
    return result;
}
U myMod(U const& a, U const& b)
{
    U result;
    asm (
     "xor %[dividendHigh], %[dividendHigh]  \n\t"
     "divq %[divisor]  \n\t"
     : [dividendHigh] "=d" (result)
     : [dividendLow] "a"(a), [divisor] "rm"(b)
    );
    return result;
}
A myDivMod1(U const& a, U const& b)
{
    A result;
    result.quo = myDiv(a, b);
    result.rem = myMod(a, b);
    return result;
}
A myDivMod2(U const& a, U const& b)
{
    A result;
    asm (
     "xor %[dividendHighOrRemainder], %[dividendHighOrRemainder]  \n\t"
     "divq %[divisor]  \n\t"
     : [quotient] "=a"(result.quo), [dividendHighOrRemainder] "=d" (result.rem)
     : [dividendLow] "a"(a), [divisor] "rm"(b)
    );
    return result;
}


Current generated assembly (64-bit variable):
myDiv(unsigned long const&, unsigned long const&):
    mov     rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]
    ret
myMod(unsigned long const&, unsigned long const&):
    mov     rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]
    mov     rax, rdx
    ret
myDivMod1(unsigned long const&, unsigned long const&):
    mov     rcx, qword ptr [rdi]
    mov     rdi, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rdi
    mov     rax, rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]
    mov     rsi, rax
    mov     qword ptr [rsp - 16], rdi
    mov     rax, rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 16]
    mov     rax, rsi
    ret
myDivMod2(unsigned long const&, unsigned long const&):
    mov     rax, qword ptr [rdi]
    mov     rcx, qword ptr [rsi]
    mov     qword ptr [rsp - 8], rcx
    xor     rdx, rdx
    div     qword ptr [rsp - 8]
    ret


Godbolt link (GCC, generate suboptimal assembly code for myDivMod1, which is
not the same as myDivMod2)
64-bit variable: https://gcc.godbolt.org/z/7j9MsKsMr
32-bit variable: https://gcc.godbolt.org/z/EEKzWWx14
16-bit variable: https://gcc.godbolt.org/z/5j9zoxrro
8-bit variable: https://gcc.godbolt.org/z/3sWh9K6r5


Godbolt link (Clang, generate optimal assembly code for myDivMod1, which is the
same as myDivMod2)
64-bit variable: https://gcc.godbolt.org/z/cneMq3eYT
32-bit variable: https://gcc.godbolt.org/z/E1e9x65hj
16-bit variable: https://gcc.godbolt.org/z/3aq9dsa3M
8-bit variable: https://gcc.godbolt.org/z/8EG8TEzGq


Optimization suggestion
I suggest enhancing GCC to recognize situations where multiple non-volatile
inline assembly blocks across function calls share identical or highly similar
operations, and optimize them by merging the operations when semantically safe.
For the example above, the assembly code generated for myDivMod1 should be no
more complex than that of myDivMod2 and contain no more than one div
instruction.


Theoretical basis
According to GCC's inline assembly documentation, non-volatile asm statements
can be reordered and optimized by the compiler. In this case, both myDiv and
myMod are essentially performing the same div operation but extracting
different outputs (quotient vs. remainder). Since the x86 div instruction
naturally produces both results simultaneously, executing it twice is
redundant.
Another strong reason to support this optimization suggestion is that, Clang
generates identical code for both myDivMod1 and myDivMod2 (see the godbolt link
above for details), which proves that Clang has effectively optimized the case
in myDivMod1, but GCC has not.


Broader application
This optimization pattern would benefit not just this specific case but any
scenario where:
1. Multiple non-volatile inline assembly blocks across function calls contain
similar operations.
2. These operations could be merged without changing program semantics.
3. The reuse of intermediate results would reduce instruction count.
Similar to function inlining optimizations, the compiler could analyze the
context of inline assembly calls and merge redundant operations where safe.


Expected benefit
Such optimization would result in more efficient code generation, particularly
important for performance-critical applications using inline assembly for
specialized operations like cryptography, multimedia processing, or low-level
system operations.


I appreciate your consideration of this enhancement request and am happy to
provide any additional information if needed.
Sincerely.

Reply via email to