https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120020
Bug ID: 120020 Summary: [Optimization opportunity] Optimize repeated inline assembly operations when semantically equivalent Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: a1343922569 at outlook dot com Target Milestone: --- I would like to propose an optimization enhancement regarding inline assembly operations in C++, specifically for cases where multiple non-volatile asm statements perform similar or nearly identical operations. Description: Currently, when a function calls multiple other functions containing similar inline assembly operations (e.g., one retrieving the quotient and another retrieving the remainder from a division operation), GCC generates code that repeats the same assembly instruction multiple times, even though a single execution could provide all needed results. Environment Information: GCC version: GCC 16.0.0 System type: x86_64-linux-gnu Complete Command Line: g++ -O2 -S test.cpp Reproduction C++ code (64-bit variable): #include <cstdint> using U = uint64_t; struct A { U quo; U rem; }; U myDiv(U const& a, U const& b) { U result, dummyDividendHigh; asm ( "xor %[dividendHigh], %[dividendHigh] \n\t" "divq %[divisor] \n\t" : [quotient] "=a"(result), [dividendHigh] "=d" (dummyDividendHigh) : [dividendLow] "a"(a), [divisor] "rm"(b) ); return result; } U myMod(U const& a, U const& b) { U result; asm ( "xor %[dividendHigh], %[dividendHigh] \n\t" "divq %[divisor] \n\t" : [dividendHigh] "=d" (result) : [dividendLow] "a"(a), [divisor] "rm"(b) ); return result; } A myDivMod1(U const& a, U const& b) { A result; result.quo = myDiv(a, b); result.rem = myMod(a, b); return result; } A myDivMod2(U const& a, U const& b) { A result; asm ( "xor %[dividendHighOrRemainder], %[dividendHighOrRemainder] \n\t" "divq %[divisor] \n\t" : [quotient] "=a"(result.quo), [dividendHighOrRemainder] "=d" (result.rem) : [dividendLow] "a"(a), [divisor] "rm"(b) ); return result; } Current generated assembly (64-bit variable): myDiv(unsigned long const&, unsigned long const&): mov rax, qword ptr [rdi] mov rcx, qword ptr [rsi] mov qword ptr [rsp - 8], rcx xor rdx, rdx div qword ptr [rsp - 8] ret myMod(unsigned long const&, unsigned long const&): mov rax, qword ptr [rdi] mov rcx, qword ptr [rsi] mov qword ptr [rsp - 8], rcx xor rdx, rdx div qword ptr [rsp - 8] mov rax, rdx ret myDivMod1(unsigned long const&, unsigned long const&): mov rcx, qword ptr [rdi] mov rdi, qword ptr [rsi] mov qword ptr [rsp - 8], rdi mov rax, rcx xor rdx, rdx div qword ptr [rsp - 8] mov rsi, rax mov qword ptr [rsp - 16], rdi mov rax, rcx xor rdx, rdx div qword ptr [rsp - 16] mov rax, rsi ret myDivMod2(unsigned long const&, unsigned long const&): mov rax, qword ptr [rdi] mov rcx, qword ptr [rsi] mov qword ptr [rsp - 8], rcx xor rdx, rdx div qword ptr [rsp - 8] ret Godbolt link (GCC, generate suboptimal assembly code for myDivMod1, which is not the same as myDivMod2) 64-bit variable: https://gcc.godbolt.org/z/7j9MsKsMr 32-bit variable: https://gcc.godbolt.org/z/EEKzWWx14 16-bit variable: https://gcc.godbolt.org/z/5j9zoxrro 8-bit variable: https://gcc.godbolt.org/z/3sWh9K6r5 Godbolt link (Clang, generate optimal assembly code for myDivMod1, which is the same as myDivMod2) 64-bit variable: https://gcc.godbolt.org/z/cneMq3eYT 32-bit variable: https://gcc.godbolt.org/z/E1e9x65hj 16-bit variable: https://gcc.godbolt.org/z/3aq9dsa3M 8-bit variable: https://gcc.godbolt.org/z/8EG8TEzGq Optimization suggestion I suggest enhancing GCC to recognize situations where multiple non-volatile inline assembly blocks across function calls share identical or highly similar operations, and optimize them by merging the operations when semantically safe. For the example above, the assembly code generated for myDivMod1 should be no more complex than that of myDivMod2 and contain no more than one div instruction. Theoretical basis According to GCC's inline assembly documentation, non-volatile asm statements can be reordered and optimized by the compiler. In this case, both myDiv and myMod are essentially performing the same div operation but extracting different outputs (quotient vs. remainder). Since the x86 div instruction naturally produces both results simultaneously, executing it twice is redundant. Another strong reason to support this optimization suggestion is that, Clang generates identical code for both myDivMod1 and myDivMod2 (see the godbolt link above for details), which proves that Clang has effectively optimized the case in myDivMod1, but GCC has not. Broader application This optimization pattern would benefit not just this specific case but any scenario where: 1. Multiple non-volatile inline assembly blocks across function calls contain similar operations. 2. These operations could be merged without changing program semantics. 3. The reuse of intermediate results would reduce instruction count. Similar to function inlining optimizations, the compiler could analyze the context of inline assembly calls and merge redundant operations where safe. Expected benefit Such optimization would result in more efficient code generation, particularly important for performance-critical applications using inline assembly for specialized operations like cryptography, multimedia processing, or low-level system operations. I appreciate your consideration of this enhancement request and am happy to provide any additional information if needed. Sincerely.