https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765
Bug ID: 102765 Summary: [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: d Assignee: ibuclaw at gdcproject dot org Reporter: siarhei.siamashka at gmail dot com Target Milestone: --- The performance of the following simple binary search code regressed a lot starting from GDC11: /*******************************************************/ import std.algorithm, std.range, std.stdio, std.stdint; // calculate integer square root using binary search int64_t isqrt(int64_t x) { return iota(0, min(x, 3037000499) + 1) .map!(v => (v * v > x)) .assumeSorted.lowerBound(true) .length - 1; } // print the sum of 20M square roots void main() { 20000000.iota.map!isqrt.sum.writeln; } /*******************************************************/ $ gdc-6.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 59618479180 real 0m1.924s user 0m1.924s sys 0m0.000s $ gdc-9.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 59618479180 real 0m2.100s user 0m2.099s sys 0m0.000s $ gdc-10.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 59618479180 real 0m1.776s user 0m1.776s sys 0m0.000s $ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 59618479180 real 0m6.889s user 0m6.887s sys 0m0.000s My expectation is that the compilers should inline everything here and generate code for a small and efficient binary search loop. But GDC11 stopped doing this, as can be confirmed by running "perf record ./a.out && perf report": 27.86% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc__T18getTransitionIndexVEQGrQGq12SearchPolicyi3SQHoQHn__TQHkTQHaVQDha5_61203c2062ZQIj3geqTbZQDlMFNaNbNiNfbZm 15.02% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc__T3geqTbTbZQjMFNaNbNiNfbbZb 10.34% a.out a.out [.] _D3std9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQCm5range__T4iotaTiTlZQkFilZ6ResultZQCv7opIndexMFNaNbNiNfmZb 10.31% a.out a.out [.] _D3std10functional__T9binaryFunVAyaa5_61203c2062VQra1_61VQza1_62Z__TQBvTbTbZQCdFNaNbNiNfKbKbZb 3.03% a.out a.out [.] _D3std5range__T4iotaTiTlZQkFilZ6Result7opIndexMNgFNaNbNiNfmZNgl 2.34% a.out a.out [.] 0x0000000000031a09 2.28% a.out a.out [.] _D4core6atomic__T7casImplTmTxmTmZQqFNaNbNiNePOmxmmZb 2.11% a.out a.out [.] _D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc7opSliceMFNaNbNiNfmmZSQGoQGn__TQGkTQGaVQCha5_61203c2062ZQHj 2.02% a.out a.out [.] _D3std5range__T12assumeSortedVAyaa5_61203c2062TSQBu9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQEfQEe__T4iotaTiTlZQkFilZ6ResultZQCsZQFdFNaNbNiNfQEjZSQGhQGg__T11SortedRangeTQFlVQGga5_61203c2062ZQBj Using either -fwhole-program or -flto cmdline options resolves the performance problem and allows all of these functions to be inlined again: $ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check -flto test.d && time ./a.out 59618479180 real 0m2.085s user 0m2.085s sys 0m0.000s But is this expected? Does GDC now require using -flto option for getting reasonable performance starting from version 11? Or is this a real performance regression and something can be done to improve the inlining behaviour?