https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #7 from Alex Coplan <acoplan at gcc dot gnu.org> ---
So it turns out the reason #pragma GCC unroll doesn't work under LTO is because
we don't propagate the `has_unroll` flag when streaming functions during LTO,
so RTL loop2_unroll ends up not running at all.

The following patch allows us to recover it:

diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 2e592be8082..93877065d86 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -1136,6 +1136,8 @@ input_cfg (class lto_input_block *ib, class data_in
*data_in,
       /* Read OMP SIMD related info.  */
       loop->safelen = streamer_read_hwi (ib);
       loop->unroll = streamer_read_hwi (ib);
+      if (loop->unroll > 1)
+       fn->has_unroll = true;
       loop->owned_clique = streamer_read_hwi (ib);
       loop->dont_vectorize = streamer_read_hwi (ib);
       loop->force_vectorize = streamer_read_hwi (ib);

a more conservative fix might be to explicitly stream has_unroll out and in
again, but the above is simpler and I don't currently see a reason why we can't
infer it like this (comments welcome).

Anyway, this (together with the above C++ patch and adding the #pragma to
std::__find_if) gives us back ~3.9% on Neoverse V1.  That recovers about 71% of
the regression, leaving the effective regression (relative to the hand-unrolled
code) at 1.7% instead of 5.8%.

It's possible there are further improvements to be had by tweaking the unrolled
codegen or making inlining heuristics take #pragma GCC unroll into account
(assuming they don't currently, I haven't checked).  I'll try to do some more
analysis on the remaining difference.

In any case, I'll aim to polish and submit these patches unless there are any
objections at this point.

Reply via email to