Ok, so if I understand correctly Palmer and Andrew prefer overlap_op_by_pieces to be controlled by its own field in the riscv_tune_param struct and not by the field slow_unaligned_access in this struct (i.e. slow_unaligned_access==false is not enough to imply overlap_op_by_pieces==true).
I don't have access to pipeline details that give proof that there are cases where this patch causes a performance penalty. So, I leave this here as a summary for someone who has enough information and interest to move this forward: * the original patch should be sufficient, but does not have tests: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html * the tests can be taken from this patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html Note, that there is a duplicated "sw" in builtins-overlap-6.c, which should be a "sd". Thanks for the feedback! On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <pal...@dabbelt.com> wrote: > > On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote: > > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <pal...@dabbelt.com> wrote: > >> > >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote: > >> > Could you add a testcase? Otherwise LGTM. > >> > > >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64 > >> > void foo(char *dst){ > >> > __builtin_memset(dst, 0, 15); > >> > } > >> > >> I'd like to see: > >> > >> * Test results. This is only on for one target right now, so relying on > >> it to just work on others isn't a good idea. > >> * Something to demonstrate this doesn't break -mstrict-align. > >> * Some sort of performance analysis. Most machines that support > >> unaligned access do so with some performance degredation, > > > > Also, some machines that gracefully support misaligned accesses under > > most circumstances nevertheless experience a perf degradation when the > > load depends on two stores that overlap partially but not fully. This > > transformation will obviously trigger such behavior from time to time. > > Ya, I thought I wrote a response to this but I guess it's just in a > buffer somewhere. The code sequences this is generating are really the > worst case for unaligned stores: one of them is always guaranteed to be > misaligned, and it partially overlaps with a store one cycle away. > > We're really only saving a handful of instructions at best here, so > there's not much room for error when it comes to these sorts of things. > Even if this difficult case is handled fast we're only talking about > saving 2 cycles, which is pretty borderline as the single-issue in-order > machines I've worked with that do support misaligned accesses tend to > take at least a few cycles of performance hit on misaligned accesses. > Even if misaligned accesses are single cycle, some back of the envelope > calculations says that a pipeline flush when crossing a cache line would > definitely push this negative and one per page crossing would be in the > margins (depending on how we assume the original accesses are aligned). > > This is way too subtle of a thing to analyze without at least some > knowledge of how this pipeline works, whether that's from a benchmark or > just a pipeline description. > > > Note, I'm not objecting to this patch; I'm only responding to Palmer's > > comment. It makes sense to enable this kind of optimization for > > -mtune=native etc., just not for standard software distributions. > > IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native > is sort of a red herring, it's just syntax). For -mtune=c906 I'm happy > enabling this as long as it's actually correct and improves performance, > but that'll need at least some hardware-oriented analysis -- whether > it's a benchmark or just a confirmation that these sequences are > actually expected to run fast. > > -Os is a different case, though. Last time this came up we decided that > -Os shouldn't purposefully misalign accesses, even when that saves code > size, as that's too likely to hit pathological performance cases. I > don't know if the weights have changed here: the C906 is currently by > far the cheapest chip (which likely means it's going to be the most > popular), but it's unclear as to whether it's even RISC-V compliant and > I don't know of many people who've actually gotten one. IMO it's too > early to flip -Os over to this behavior, we at least need to know that > this chip is going to be sufficiently RISC-V-ey that standard software > will even run on it much less be popular. > > > > > > >> it's unclear > >> that this conversion will actually manifst a performance improvement. > >> I don't have a C906 and don't know what workloads people care about > >> running on one, but I'm certainly not convinced this is a win -- > >> what's listed here looks to be the best case, and that's only saving > >> two instructions to generate a pretty pathological sequence > >> (misaligned access that conflicts with a prior store). > > Ah, I guess there it was ;) > > >> > >> Jojo: do you have any description of the C906 pipeline? Specifically in > >> this case it'd be good to know how it handles unaligned accesses. > >> > >> > > >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches > >> > <gcc-patches@gcc.gnu.org> wrote: > >> >> > >> >> This patch enables the overlap-by-pieces feature of the by-pieces > >> >> infrastructure for inlining builtins in case the target has set > >> >> riscv_slow_unaligned_access_p to false. > >> >> > >> >> To demonstrate the effect for targets with fast unaligned access, > >> >> the following code sequences are generated for a 15-byte memset-zero. > >> >> > >> >> Without overlap_op_by_pieces we get: > >> >> 8e: 00053023 sd zero,0(a0) > >> >> 92: 00052423 sw zero,8(a0) > >> >> 96: 00051623 sh zero,12(a0) > >> >> 9a: 00050723 sb zero,14(a0) > >> >> > >> >> With overlap_op_by_pieces we get: > >> >> 7e: 00053023 sd zero,0(a0) > >> >> 82: 000533a3 sd zero,7(a0) > >> >> > >> >> gcc/ChangeLog: > >> >> > >> >> * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New > >> >> function. > >> >> (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to > >> >> riscv_overlap_op_by_pieces. > >> >> > >> >> Signed-off-by: Christoph Muellner <cmuell...@gcc.gnu.org> > >> >> --- > >> >> gcc/config/riscv/riscv.c | 11 +++++++++++ > >> >> 1 file changed, 11 insertions(+) > >> >> > >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c > >> >> index 576960bb37c..98c76ba657a 100644 > >> >> --- a/gcc/config/riscv/riscv.c > >> >> +++ b/gcc/config/riscv/riscv.c > >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, > >> >> unsigned int) > >> >> return riscv_slow_unaligned_access_p; > >> >> } > >> >> > >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P. */ > >> >> + > >> >> +static bool > >> >> +riscv_overlap_op_by_pieces (void) > >> >> +{ > >> >> + return !riscv_slow_unaligned_access_p; > >> >> +} > >> >> + > >> >> /* Implement TARGET_CAN_CHANGE_MODE_CLASS. */ > >> >> > >> >> static bool > >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void) > >> >> #undef TARGET_SLOW_UNALIGNED_ACCESS > >> >> #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access > >> >> > >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P > >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces > >> >> + > >> >> #undef TARGET_SECONDARY_MEMORY_NEEDED > >> >> #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed > >> >> > >> >> -- > >> >> 2.31.1 > >> >>