https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115049
Bug ID: 115049 Summary: Silent severe miscompilation around inline functions Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: manx-bugzilla at problemloesungsmaschine dot de Target Milestone: --- Created attachment 58182 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58182&action=edit all mentioned attachments Hello GCC team! I am seeing severe miscompilation around inline functions with different results across different translation units, ultimately leading to application crashes. I am running ``` $ x86_64-w64-mingw32-g++ --version x86_64-w64-mingw32-g++.exe (Rev2, Built by MSYS2 project) 14.1.0 Copyright (C) 2024 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` as shipped by MSYS2, and testing with the MINGW64 toolchain targeting amd64. I can also reproduce with UCRT64 amd64, but so far not with MINGW32 targeting x86. This is reproducible locally and on GitHub CI, so no system-specific problem. To reproduce, we need 2 translation units: file1.cpp: ``` #include <random> #include <cstdint> void Trigger1(); int main( int /*argc*/ , char * /*argv*/ [] ) { Trigger1(); return 0; } static void Trigger2() { std::ranlux48 prng{1}; std::uint16_t Data1 = std::uniform_int_distribution<std::uint16_t>{}(prng); std::uint64_t Data2 = std::uniform_int_distribution<std::uint64_t>{}(prng); static_cast<void>(Data1); static_cast<void>(Data2); } using my_dummy_function = void (*)(); struct myDummy3 { my_dummy_function func{nullptr}; myDummy3(my_dummy_function f) : func(f) { } }; myDummy3 dummy3{ &Trigger2 }; ``` file2.cpp: ``` #include <algorithm> #include <memory> #include <random> #include <vector> #include <cstddef> #include <cstdint> #include <stdio.h> template <typename T, std::size_t required_entropy_bits> inline T my_random(std::ranlux48 & rng) { using unsigned_T = T; const unsigned int rng_bits = 48; unsigned_T result = 0; for (std::size_t entropy = 0; entropy < std::min(required_entropy_bits, sizeof(T) * 8); entropy += rng_bits) { if constexpr (rng_bits < (sizeof(T) * 8)) { unsigned int shift_bits = rng_bits % (sizeof(T) * 8); //fflush(stdout); // <--- no crash result = (result << shift_bits) ^ static_cast<unsigned_T>(rng()); } else { result = static_cast<unsigned_T>(rng()); } } if constexpr (required_entropy_bits >= (sizeof(T) * 8)) { return static_cast<T>(result); } else { return static_cast<T>(result & ((static_cast<unsigned_T>(1) << required_entropy_bits) - static_cast<unsigned_T>(1))); } } class myDummy1 { private: std::ranlux48 m_PRNG; std::uint16_t m_Trigger1a; public: myDummy1(std::unique_ptr<std::mt19937> & rd) : m_PRNG(std::uniform_int_distribution<unsigned int>{}(*rd)) , m_Trigger1a(m_PRNG()) { } }; void Trigger1(); void Trigger1() { { std::unique_ptr<std::mt19937> rd = std::make_unique<std::mt19937>(1); myDummy1 dummy1(rd); } { std::ranlux48 * prng = new std::ranlux48(1); const unsigned int trigger1b_size = 32; std::vector<std::uint8_t> trigger1b(trigger1b_size, 0); for (unsigned int i = 0; i < trigger1b_size; i++) { trigger1b[i] = my_random<std::uint8_t, 8>(*prng); } delete prng; } { std::ranlux48 prng{1}; printf("%llu\n", static_cast<unsigned long long>(my_random<std::uint64_t, 53>(prng))); fflush(stdout); printf("%llu\n", static_cast<unsigned long long>(my_random<std::uint64_t, 53>(prng))); fflush(stdout); printf("%llu\n", static_cast<unsigned long long>(my_random<std::uint64_t, 53>(prng))); fflush(stdout); // <--- crash printf("%llu\n", static_cast<unsigned long long>(my_random<std::uint64_t, 53>(prng))); fflush(stdout); } } ``` and the compilation script: compile.sh: ``` #!/usr/bin/env bash x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX -c file1.cpp -o file1.o x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX -c file2.cpp -o file2.o x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX -c file2-working.cpp -o file2-working.o x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX file1.o file2.o -latomic -lm -o test-broken.exe x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX file1.o file2-working.o -latomic -lm -o test-working.exe cp file1.o file1-trimmed.o objcopy --remove-section '.text$_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv' file1-trimmed.o objcopy --remove-section '.xdata$_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv' file1-trimmed.o objcopy --remove-section '.pdata$_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv' file1-trimmed.o x86_64-w64-mingw32-g++ -std=c++20 -fexceptions -frtti -mthreads -O2 -Wall -Wextra -Wpedantic -DNOMINMAX file1-trimmed.o file2.o -latomic -lm -o test-trimmed.exe objdump -M intel --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv file1.o > file1.asm objdump -M intel --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv file2.o > file2.asm objdump -M intel --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv file2-working.o > file2-working.asm objdump -M intel --no-addresses --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv test-broken.exe > test-broken.asm objdump -M intel --no-addresses --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv test-trimmed.exe > test-trimmed.asm objdump -M intel --no-addresses --disassemble=_ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv test-working.exe > test-working.asm objdump -D test-broken.exe > full-test-broken.asm objdump -D test-trimmed.exe > full-test-trimmed.asm objdump -D test-working.exe > full-test-working.asm ``` Now, when I run test-*.exe, I get: ``` $ ./test-broken.exe 3578273826077799 7151956549303778 Segmentation fault ``` ``` $ ./test-trimmed.exe 3578273826077799 7151956549303778 8042221890545872 8274253793764351 ``` ``` $ ./test-working.exe 3578273826077799 7151956549303778 8042221890545872 8274253793764351 ``` The crash happens in `std::discard_block_engine<std::subtract_with_carry_engine<unsigned long long, 48ull, 5ull, 12ull>, 389ull, 11ull>::operator()() ()`. file1.asm, file2.asm, file2-working.asm, test-broken.asm, test-trimmed.asm, test-working.asm, full-test-broken.asm, full-test-trimmed.asm, full-test-working.asm, as well as the files quoted above are all attached. The code in file1.cpp:Trigger2() is never called. It however causes a template instantiation of _ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv. When I remove this particular function from the object file before linking (see compile.sh, test-trimmed.exe), the linker will choose the version that got generated in file2.o. The generated function in file2.o and file2-working.o is identical. It is different from the version generated in file1.o but neither is necessarily broken (it looks like the version in file2.o has a tail call to std::subtract_with_carry_engine<unsigned long long, 48ull, 5ull, 12ull>::operator()() instead of inlining it). So, when compiling file2.cpp, GCC has choosen a particular implementation of _ZNSt20discard_block_engineISt26subtract_with_carry_engineIyLy48ELy5ELy12EELy389ELy11EEclEv to attach the symbol to, *and* the calling code inside file2.cpp assumes that it would be calling this particular copy, and only works in that case. Uncommenting the line marked with "no crash" in file2.cpp (see file2-working.cpp) makes the problem go away even though the generated function itself does not change at all. I have checked for Undefined Behaviour with ubsan and asan in the test case with GCC 14 on godbolt (https://godbolt.org/z/PMExccxbv), but it does come out clean. Reading the source of std::ranlux48 in libstdc++ also does not show any obvious problem to me either. If I had to guess: To me, this looks like GCC has maybe dragged some context of a particular instantiation of a callee out into the caller, which would be an invalid assumption/optimization unless guaranteed that this particular instantiation gets called - which would be VERY BAD and could possibly break all C++ code at random. I currently can only trigger the problem with -O2 and only on MSYS2 MINGW64/URT64 amd64 so far, -O1 and -O3 work fine for me and for this particular case. Notice that I have not been able to reproduce the problem on Linux (yet), however given the nature of the symptom I somewhat doubt that other platforms would not be affected. GCC might make different optimization choices here due to ABI differences. We had triggered this problem in real world code in libopenmpt (https://lib.openmpt.org/, https://github.com/OpenMPT/openmpt/) when suddenly our unit tests (about ~3000 individual tests) started crashing on MSYS2 (after their GCC 14 update) at 1 particular test case. Narrowing it down to a somewhat grokable test case took about 20 hours and 3 persons. As a countermeasure we for now had to force GCC 14 to down -O1 (and never use -O2 or -O3) for our next release. It is not clear to us whether this drastic work-around is even sufficient (we do have functions marked with __attribute__((always_inline)))). Is there a less drastic and maybe more specific work-around available? Cheers, and have a nice day debugging this! :) Jörn