Ok, so if I understand correctly Palmer and Andrew prefer
overlap_op_by_pieces to be controlled
by its own field in the riscv_tune_param struct and not by the field
slow_unaligned_access in this struct
(i.e. slow_unaligned_access==false is not enough to imply
overlap_op_by_pieces==true).

I don't have access to pipeline details that give proof that there are cases
where this patch causes a performance penalty.

So, I leave this here as a summary for someone who has enough information and
interest to move this forward:
* the original patch should be sufficient, but does not have tests:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
* the tests can be taken from this patch:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
  Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
should be a "sd".

Thanks for the feedback!

On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:
>
> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:
> >>
> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> >> > Could you add a testcase? Otherwise LGTM.
> >> >
> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> >> > void foo(char *dst){
> >> >    __builtin_memset(dst, 0, 15);
> >> > }
> >>
> >> I'd like to see:
> >>
> >> * Test results.  This is only on for one target right now, so relying on
> >>   it to just work on others isn't a good idea.
> >> * Something to demonstrate this doesn't break -mstrict-align.
> >> * Some sort of performance analysis.  Most machines that support
> >>   unaligned access do so with some performance degredation,
> >
> > Also, some machines that gracefully support misaligned accesses under
> > most circumstances nevertheless experience a perf degradation when the
> > load depends on two stores that overlap partially but not fully.  This
> > transformation will obviously trigger such behavior from time to time.
>
> Ya, I thought I wrote a response to this but I guess it's just in a
> buffer somewhere.  The code sequences this is generating are really the
> worst case for unaligned stores: one of them is always guaranteed to be
> misaligned, and it partially overlaps with a store one cycle away.
>
> We're really only saving a handful of instructions at best here, so
> there's not much room for error when it comes to these sorts of things.
> Even if this difficult case is handled fast we're only talking about
> saving 2 cycles, which is pretty borderline as the single-issue in-order
> machines I've worked with that do support misaligned accesses tend to
> take at least a few cycles of performance hit on misaligned accesses.
> Even if misaligned accesses are single cycle, some back of the envelope
> calculations says that a pipeline flush when crossing a cache line would
> definitely push this negative and one per page crossing would be in the
> margins (depending on how we assume the original accesses are aligned).
>
> This is way too subtle of a thing to analyze without at least some
> knowledge of how this pipeline works, whether that's from a benchmark or
> just a pipeline description.
>
> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
> > comment.  It makes sense to enable this kind of optimization for
> > -mtune=native etc., just not for standard software distributions.
>
> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
> enabling this as long as it's actually correct and improves performance,
> but that'll need at least some hardware-oriented analysis -- whether
> it's a benchmark or just a confirmation that these sequences are
> actually expected to run fast.
>
> -Os is a different case, though.  Last time this came up we decided that
> -Os shouldn't purposefully misalign accesses, even when that saves code
> size, as that's too likely to hit pathological performance cases.  I
> don't know if the weights have changed here: the C906 is currently by
> far the cheapest chip (which likely means it's going to be the most
> popular), but it's unclear as to whether it's even RISC-V compliant and
> I don't know of many people who've actually gotten one.  IMO it's too
> early to flip -Os over to this behavior, we at least need to know that
> this chip is going to be sufficiently RISC-V-ey that standard software
> will even run on it much less be popular.
>
> >
> >
> >>   it's unclear
> >>   that this conversion will actually manifst a performance improvement.
> >>   I don't have a C906 and don't know what workloads people care about
> >>   running on one, but I'm certainly not convinced this is a win --
> >>   what's listed here looks to be the best case, and that's only saving
> >>   two instructions to generate a pretty pathological sequence
> >>   (misaligned access that conflicts with a prior store).
>
> Ah, I guess there it was ;)
>
> >>
> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
> >> this case it'd be good to know how it handles unaligned accesses.
> >>
> >> >
> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >>
> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
> >> >> infrastructure for inlining builtins in case the target has set
> >> >> riscv_slow_unaligned_access_p to false.
> >> >>
> >> >> To demonstrate the effect for targets with fast unaligned access,
> >> >> the following code sequences are generated for a 15-byte memset-zero.
> >> >>
> >> >> Without overlap_op_by_pieces we get:
> >> >>   8e:   00053023                sd      zero,0(a0)
> >> >>   92:   00052423                sw      zero,8(a0)
> >> >>   96:   00051623                sh      zero,12(a0)
> >> >>   9a:   00050723                sb      zero,14(a0)
> >> >>
> >> >> With overlap_op_by_pieces we get:
> >> >>   7e:   00053023                sd      zero,0(a0)
> >> >>   82:   000533a3                sd      zero,7(a0)
> >> >>
> >> >> gcc/ChangeLog:
> >> >>
> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New 
> >> >> function.
> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
> >> >>         riscv_overlap_op_by_pieces.
> >> >>
> >> >> Signed-off-by: Christoph Muellner <cmuell...@gcc.gnu.org>
> >> >> ---
> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
> >> >>  1 file changed, 11 insertions(+)
> >> >>
> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
> >> >> index 576960bb37c..98c76ba657a 100644
> >> >> --- a/gcc/config/riscv/riscv.c
> >> >> +++ b/gcc/config/riscv/riscv.c
> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, 
> >> >> unsigned int)
> >> >>    return riscv_slow_unaligned_access_p;
> >> >>  }
> >> >>
> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
> >> >> +
> >> >> +static bool
> >> >> +riscv_overlap_op_by_pieces (void)
> >> >> +{
> >> >> +  return !riscv_slow_unaligned_access_p;
> >> >> +}
> >> >> +
> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
> >> >>
> >> >>  static bool
> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
> >> >>
> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
> >> >> +
> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
> >> >>
> >> >> --
> >> >> 2.31.1
> >> >>

Reply via email to