Re: [PATCH] Check calls before loop unrolling

Richard Biener via Gcc-patches Mon, 23 Nov 2020 00:43:08 -0800

On Fri, Nov 20, 2020 at 7:11 PM Segher Boessenkool
<seg...@kernel.crashing.org> wrote:
>
> Hi!
>
> On Fri, Nov 20, 2020 at 04:22:47PM +0100, Jan Hubicka wrote:
> > As you know I spend quite some time on inliner heuristics but even after
> > the years I have no clear idea how the requirements differs from x86-64
> > to ppc, arm and s390.  Clearly compared to x86_64 prologues may get more
> > expensive on ppc/arm because of more registers (so we should inline less
> > to cold code) and function calls are more expensive (so we sould inline
> > more to hot code). We do have PR for that in testusite where most of
> > them I looked through.
>
> I made -fshrink-wrap-separate to make prologues less expensive for stuff
> that is only used on the cold paths.  This matters a lot -- and much
> more could be done there, but that requires changing the generated code,
> not just reordering it, so it is harder to do.
>
> Prologues (and epilogues) are only expensive if they are only needed for
> cold code, in a hot function.
>
> > Problem is that each of us has different metodology - different
> > bechmarks to look at
>
> This is a good thing often as well, it increases our total coverage.
> But if not everything sees all results that also hurts :-/
>
> > and different opinions on what is good for O2 and
> > O3.
>
> Yeah.  The documentation for -O3 merely says "Optimize yet more.", but
> that is no guidance at all: why would a user ever use -O2 then?
>
> I always understood it as "-O2 is always faster than -O1, but -O3 is not
> always faster than -O2".  Aka "-O2 is always a good choice, and -O3 is a
> an even better choice for *some* code, but that needs testing per case".


So basically -O2 is supposed to be well-balanced in compile-time, code-size,
performance and debuggability (if there's sth like that with optimized code...).

-O1 is what you should use for machine-generated code, we kind-of promise
to have no quadratic or worse algorithms in compile-time/memory-use here
so you can throw a multi-gigabyte source function at GCC and it should not
blow up.  And -O1 still optimizes.

-Os is when you want small code size at all cost (compile-time, less
performance).

-O3 is when you want performance at all cost (compile-time + code size and
the ability to debug)

So I'd always use -O2 unless doing a compute workload where I'd chose
-O3 (maybe selective for the relevant TUs).

Then there's profile-feedback and LTO which are enable them if you can
(I'd avoid them for code you need -O1 for).  It really helps GCC to make
an appropriate decision what code to optimize more/less for the goal
of the balanced profile which -O2 is.

> In at least that understanding, and also to battle inflation in general,
> we probably should move some things from -O3 to -O2.
>
> > From long term maintenace POV I am worried about changing a lot of
> > --param defaults in different backends
>
> Me too.  But changing a few key ones is just too important for
> performance :-/
>
> > simply becuase the meaning of
> > those values keeps changing (as early opts improve; we get better on
> > tracking optimizations during IPA passes; and our focus shift from C
> > with sane inlines to basic C++ to heavy templatized C++ with many broken
> > inline hints to heavy C++ with lto).
>
> I don't like if targets start to differ too much (in what generic passes
> effectively do), no matter what.  It's just not maintainable.
>
> > For this reason I tend to preffer to not tweak in taret specific ways
> > unless there is very clear evidence to do so just because I think I will
> > not be able to maintain code quality testing in future.
>
> Yes, completely agreed.  But that exception is important :-)
>
> > It would be very interesting to set up testing that could let us compare
> > basic arches side to side to different defaults. Our LNT testing does
> > good job for x86-64 but we have basically zero coverage publically
> > available on other targets and it is very hard to get inliner relevant
> > banchmarks (where SPEC is not the best choice) done in comparable way on
> > multiple arches.
>
> We cannot help with that on the cfarm, unless we get dedicated hardware
> for such benchmarking (and I am not holding my breath for that, getting
> good coverage at all is hard enough).  So you probably need to get such
> support for every arch separately, elsewhere :-/
>
>
> Segher

Re: [PATCH] Check calls before loop unrolling

Reply via email to