https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140
--- Comment #7 from Alex Coplan <acoplan at gcc dot gnu.org> --- So it turns out the reason #pragma GCC unroll doesn't work under LTO is because we don't propagate the `has_unroll` flag when streaming functions during LTO, so RTL loop2_unroll ends up not running at all. The following patch allows us to recover it: diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc index 2e592be8082..93877065d86 100644 --- a/gcc/lto-streamer-in.cc +++ b/gcc/lto-streamer-in.cc @@ -1136,6 +1136,8 @@ input_cfg (class lto_input_block *ib, class data_in *data_in, /* Read OMP SIMD related info. */ loop->safelen = streamer_read_hwi (ib); loop->unroll = streamer_read_hwi (ib); + if (loop->unroll > 1) + fn->has_unroll = true; loop->owned_clique = streamer_read_hwi (ib); loop->dont_vectorize = streamer_read_hwi (ib); loop->force_vectorize = streamer_read_hwi (ib); a more conservative fix might be to explicitly stream has_unroll out and in again, but the above is simpler and I don't currently see a reason why we can't infer it like this (comments welcome). Anyway, this (together with the above C++ patch and adding the #pragma to std::__find_if) gives us back ~3.9% on Neoverse V1. That recovers about 71% of the regression, leaving the effective regression (relative to the hand-unrolled code) at 1.7% instead of 5.8%. It's possible there are further improvements to be had by tweaking the unrolled codegen or making inlining heuristics take #pragma GCC unroll into account (assuming they don't currently, I haven't checked). I'll try to do some more analysis on the remaining difference. In any case, I'll aim to polish and submit these patches unless there are any objections at this point.