On Fri, Nov 20, 2020 at 7:11 PM Segher Boessenkool <seg...@kernel.crashing.org> wrote: > > Hi! > > On Fri, Nov 20, 2020 at 04:22:47PM +0100, Jan Hubicka wrote: > > As you know I spend quite some time on inliner heuristics but even after > > the years I have no clear idea how the requirements differs from x86-64 > > to ppc, arm and s390. Clearly compared to x86_64 prologues may get more > > expensive on ppc/arm because of more registers (so we should inline less > > to cold code) and function calls are more expensive (so we sould inline > > more to hot code). We do have PR for that in testusite where most of > > them I looked through. > > I made -fshrink-wrap-separate to make prologues less expensive for stuff > that is only used on the cold paths. This matters a lot -- and much > more could be done there, but that requires changing the generated code, > not just reordering it, so it is harder to do. > > Prologues (and epilogues) are only expensive if they are only needed for > cold code, in a hot function. > > > Problem is that each of us has different metodology - different > > bechmarks to look at > > This is a good thing often as well, it increases our total coverage. > But if not everything sees all results that also hurts :-/ > > > and different opinions on what is good for O2 and > > O3. > > Yeah. The documentation for -O3 merely says "Optimize yet more.", but > that is no guidance at all: why would a user ever use -O2 then? > > I always understood it as "-O2 is always faster than -O1, but -O3 is not > always faster than -O2". Aka "-O2 is always a good choice, and -O3 is a > an even better choice for *some* code, but that needs testing per case".
So basically -O2 is supposed to be well-balanced in compile-time, code-size, performance and debuggability (if there's sth like that with optimized code...). -O1 is what you should use for machine-generated code, we kind-of promise to have no quadratic or worse algorithms in compile-time/memory-use here so you can throw a multi-gigabyte source function at GCC and it should not blow up. And -O1 still optimizes. -Os is when you want small code size at all cost (compile-time, less performance). -O3 is when you want performance at all cost (compile-time + code size and the ability to debug) So I'd always use -O2 unless doing a compute workload where I'd chose -O3 (maybe selective for the relevant TUs). Then there's profile-feedback and LTO which are enable them if you can (I'd avoid them for code you need -O1 for). It really helps GCC to make an appropriate decision what code to optimize more/less for the goal of the balanced profile which -O2 is. > In at least that understanding, and also to battle inflation in general, > we probably should move some things from -O3 to -O2. > > > From long term maintenace POV I am worried about changing a lot of > > --param defaults in different backends > > Me too. But changing a few key ones is just too important for > performance :-/ > > > simply becuase the meaning of > > those values keeps changing (as early opts improve; we get better on > > tracking optimizations during IPA passes; and our focus shift from C > > with sane inlines to basic C++ to heavy templatized C++ with many broken > > inline hints to heavy C++ with lto). > > I don't like if targets start to differ too much (in what generic passes > effectively do), no matter what. It's just not maintainable. > > > For this reason I tend to preffer to not tweak in taret specific ways > > unless there is very clear evidence to do so just because I think I will > > not be able to maintain code quality testing in future. > > Yes, completely agreed. But that exception is important :-) > > > It would be very interesting to set up testing that could let us compare > > basic arches side to side to different defaults. Our LNT testing does > > good job for x86-64 but we have basically zero coverage publically > > available on other targets and it is very hard to get inliner relevant > > banchmarks (where SPEC is not the best choice) done in comparable way on > > multiple arches. > > We cannot help with that on the cfarm, unless we get dedicated hardware > for such benchmarking (and I am not holding my breath for that, getting > good coverage at all is hard enough). So you probably need to get such > support for every arch separately, elsewhere :-/ > > > Segher