On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <pal...@dabbelt.com> wrote: > > On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote: > > Could you add a testcase? Otherwise LGTM. > > > > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64 > > void foo(char *dst){ > > __builtin_memset(dst, 0, 15); > > } > > I'd like to see: > > * Test results. This is only on for one target right now, so relying on > it to just work on others isn't a good idea. > * Something to demonstrate this doesn't break -mstrict-align. > * Some sort of performance analysis. Most machines that support > unaligned access do so with some performance degredation,
Also, some machines that gracefully support misaligned accesses under most circumstances nevertheless experience a perf degradation when the load depends on two stores that overlap partially but not fully. This transformation will obviously trigger such behavior from time to time. Note, I'm not objecting to this patch; I'm only responding to Palmer's comment. It makes sense to enable this kind of optimization for -mtune=native etc., just not for standard software distributions. > it's unclear > that this conversion will actually manifst a performance improvement. > I don't have a C906 and don't know what workloads people care about > running on one, but I'm certainly not convinced this is a win -- > what's listed here looks to be the best case, and that's only saving > two instructions to generate a pretty pathological sequence > (misaligned access that conflicts with a prior store). > > Jojo: do you have any description of the C906 pipeline? Specifically in > this case it'd be good to know how it handles unaligned accesses. > > > > > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> > >> This patch enables the overlap-by-pieces feature of the by-pieces > >> infrastructure for inlining builtins in case the target has set > >> riscv_slow_unaligned_access_p to false. > >> > >> To demonstrate the effect for targets with fast unaligned access, > >> the following code sequences are generated for a 15-byte memset-zero. > >> > >> Without overlap_op_by_pieces we get: > >> 8e: 00053023 sd zero,0(a0) > >> 92: 00052423 sw zero,8(a0) > >> 96: 00051623 sh zero,12(a0) > >> 9a: 00050723 sb zero,14(a0) > >> > >> With overlap_op_by_pieces we get: > >> 7e: 00053023 sd zero,0(a0) > >> 82: 000533a3 sd zero,7(a0) > >> > >> gcc/ChangeLog: > >> > >> * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function. > >> (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to > >> riscv_overlap_op_by_pieces. > >> > >> Signed-off-by: Christoph Muellner <cmuell...@gcc.gnu.org> > >> --- > >> gcc/config/riscv/riscv.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > >> > >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c > >> index 576960bb37c..98c76ba657a 100644 > >> --- a/gcc/config/riscv/riscv.c > >> +++ b/gcc/config/riscv/riscv.c > >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned > >> int) > >> return riscv_slow_unaligned_access_p; > >> } > >> > >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P. */ > >> + > >> +static bool > >> +riscv_overlap_op_by_pieces (void) > >> +{ > >> + return !riscv_slow_unaligned_access_p; > >> +} > >> + > >> /* Implement TARGET_CAN_CHANGE_MODE_CLASS. */ > >> > >> static bool > >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void) > >> #undef TARGET_SLOW_UNALIGNED_ACCESS > >> #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access > >> > >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P > >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces > >> + > >> #undef TARGET_SECONDARY_MEMORY_NEEDED > >> #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed > >> > >> -- > >> 2.31.1 > >>