On Mon, Aug 16, 2021 at 10:10 AM Palmer Dabbelt <pal...@dabbelt.com> wrote: > > On Mon, 16 Aug 2021 09:29:16 PDT (-0700), Kito Cheng wrote: > >> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field, > >> > testcase from v2 and add a few more comments to describe the field? > >> > > >> > And add an -mtune=ultra-size to make it able to test without change > >> > other behavior? > >> > > >> > Hi Palmer: > >> > > >> > Are you OK with that? > >> > >> I'm still not convinced on the performance: like Andrew and I pointed > >> out, this is a difficult case for pipelines of this flavor to handle. > >> Nobody here knows anything about this pipeline deeply enough to say > >> anything difinitive, though, so this is really just a guess. > > > > So with an extra field to indicate should resolve that? > > I believe people should only set overlap_op_by_pieces > > to true only if they are sure it has benefits. > > My only issue there is that we'd have no way to turn it on, but see > below... > > >> As I'm not convinced this is an obvious performance win I'm not going to > >> merge it without a benchmark. If you're convinced and want to merge it > >> that's fine, I don't really care about the performance fo the C906 and > >> if someone complains we can always just revert it later. > > > > I suppose Christoph has tried with their internal processor, and it's > > benefit on performance, > > but it can't be open-source yet, so v2 patch set using C906 to demo > > and test that since that is > > the only processor with slow_unaligned_access=False. > > Well, that's a very different discussion. The C906 tuning model should > be for the C906, not a proxy for some internal-only processor. If the > goal here is to allow this pass to be flipped on by an out-of-tree > pipeline model then we can talk about it. > > > I agree on the C906 part, we never know it's benefit or not, so I propose > > adding one -mtune=ultra-size to make this test-able rather than changing > > C906. > > That's essentially the same conclusion we came to last time this came > up, except that we were calling it "-Oz" (because LLVM does). I guess > we never got around having the broader GCC discussion about "-Oz". IIRC > we had some other "-Oz" candidates we never got around to dealing with, > but that was a while ago so I'm not sure if any of that panned out.
-Oz was a bad idea that Apple came up because GCC decided to start emitting store multiple on PowerPC around 13 years ago. I don't think we should repeat that mistake for GCC and especially for RISCV. If people want to optimize for size, they get the performance issues. Thanks, Andrew Pinski