On Mon, 16 Aug 2021 11:56:05 PDT (-0700), pins...@gmail.com wrote:
On Mon, Aug 16, 2021 at 10:10 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:

On Mon, 16 Aug 2021 09:29:16 PDT (-0700), Kito Cheng wrote:
>> > Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
>> > testcase from v2 and add a few more comments to describe the field?
>> >
>> > And add an -mtune=ultra-size to make it able to test without change
>> > other behavior?
>> >
>> > Hi Palmer:
>> >
>> > Are you OK with that?
>>
>> I'm still not convinced on the performance: like Andrew and I pointed
>> out, this is a difficult case for pipelines of this flavor to handle.
>> Nobody here knows anything about this pipeline deeply enough to say
>> anything difinitive, though, so this is really just a guess.
>
> So with an extra field to indicate should resolve that?
> I believe people should only set overlap_op_by_pieces
> to true only if they are sure it has benefits.

My only issue there is that we'd have no way to turn it on, but see
below...

>> As I'm not convinced this is an obvious performance win I'm not going to
>> merge it without a benchmark.  If you're convinced and want to merge it
>> that's fine, I don't really care about the performance fo the C906 and
>> if someone complains we can always just revert it later.
>
> I suppose Christoph has tried with their internal processor, and it's
> benefit on performance,
> but it can't be open-source yet, so v2 patch set using C906 to demo
> and test that since that is
> the only processor with slow_unaligned_access=False.

Well, that's a very different discussion.  The C906 tuning model should
be for the C906, not a proxy for some internal-only processor.  If the
goal here is to allow this pass to be flipped on by an out-of-tree
pipeline model then we can talk about it.

> I agree on the C906 part, we never know it's benefit or not, so I propose
> adding one -mtune=ultra-size to make this test-able rather than changing C906.

That's essentially the same conclusion we came to last time this came
up, except that we were calling it "-Oz" (because LLVM does).  I guess
we never got around having the broader GCC discussion about "-Oz".  IIRC
we had some other "-Oz" candidates we never got around to dealing with,
but that was a while ago so I'm not sure if any of that panned out.

-Oz was a bad idea that Apple came up because GCC decided to start
emitting store multiple on PowerPC around 13 years ago.
I don't think we should repeat that mistake for GCC and especially for RISCV.
If people want to optimize for size, they get the performance issues.

Makes sense. Probably best to avoid adding the RISC-V specific version of this as well, then, as it's really just two sides of the same coin.

Sounds like we'll likely want to stop implementing -Os via a tuning on RISC-V: that was a convienent way to do it wen we didn't have any conflicts between -O and -mtune, but assuming this will eventually land that won't be valid any more. That's a pretty mechinacial process.

It still leaves us with the question of what to do with this pass, which IMO really just depends on what the actual goal is here: if we're trying to optimize for the C906 then we should just wait for the benchmarks to demorstrate this is worth doing (though again, Kito, if you think this is good enough and want to flip this on I don't really care that much), but if we're trying to optimize for some other pipeline then we should really wait for that to show up.

I'm not going to speculate about what this new pipeline is, but if there's anything concrete announced about it then I'm happy to take a look. Historically we've never been super strict about waiting for hardware before taking a pipeline model, but I do think we should have something as just trying to support any hypothetical future hardware will lead to insanity. IMO we need to be extra explicit that we're willing to work with hardware vendors, as due to the nature of RISC-V that can get lost in translation, but there has to be some balance.

Reply via email to