Re: [PATCH] RISC-V: Enable overlap-by-pieces in case of fast unaliged access

Palmer Dabbelt Mon, 26 Jul 2021 18:48:07 -0700

On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:

On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:

On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
> Could you add a testcase? Otherwise LGTM.
>
> Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
> void foo(char *dst){
>    __builtin_memset(dst, 0, 15);
> }

I'd like to see:

* Test results.  This is only on for one target right now, so relying on
  it to just work on others isn't a good idea.
* Something to demonstrate this doesn't break -mstrict-align.
* Some sort of performance analysis.  Most machines that support
  unaligned access do so with some performance degredation,


Also, some machines that gracefully support misaligned accesses under
most circumstances nevertheless experience a perf degradation when the
load depends on two stores that overlap partially but not fully.  This
transformation will obviously trigger such behavior from time to time.

Ya, I thought I wrote a response to this but I guess it's just in abuffer somewhere. The code sequences this is generating are really theworst case for unaligned stores: one of them is always guaranteed to bemisaligned, and it partially overlaps with a store one cycle away.

We're really only saving a handful of instructions at best here, sothere's not much room for error when it comes to these sorts of things.Even if this difficult case is handled fast we're only talking aboutsaving 2 cycles, which is pretty borderline as the single-issue in-ordermachines I've worked with that do support misaligned accesses tend totake at least a few cycles of performance hit on misaligned accesses.Even if misaligned accesses are single cycle, some back of the envelopecalculations says that a pipeline flush when crossing a cache line woulddefinitely push this negative and one per page crossing would be in themargins (depending on how we assume the original accesses are aligned).

This is way too subtle of a thing to analyze without at least someknowledge of how this pipeline works, whether that's from a benchmark orjust a pipeline description.

Note, I'm not objecting to this patch; I'm only responding to Palmer's
comment.  It makes sense to enable this kind of optimization for
-mtune=native etc., just not for standard software distributions.

IMO there are really two cases here: -mtune=c906 and -Os (-mtune=nativeis sort of a red herring, it's just syntax). For -mtune=c906 I'm happyenabling this as long as it's actually correct and improves performance,but that'll need at least some hardware-oriented analysis -- whetherit's a benchmark or just a confirmation that these sequences areactually expected to run fast.

-Os is a different case, though. Last time this came up we decided that-Os shouldn't purposefully misalign accesses, even when that saves codesize, as that's too likely to hit pathological performance cases. Idon't know if the weights have changed here: the C906 is currently byfar the cheapest chip (which likely means it's going to be the mostpopular), but it's unclear as to whether it's even RISC-V compliant andI don't know of many people who've actually gotten one. IMO it's tooearly to flip -Os over to this behavior, we at least need to know thatthis chip is going to be sufficiently RISC-V-ey that standard softwarewill even run on it much less be popular.

  it's unclear
  that this conversion will actually manifst a performance improvement.
  I don't have a C906 and don't know what workloads people care about
  running on one, but I'm certainly not convinced this is a win --
  what's listed here looks to be the best case, and that's only saving
  two instructions to generate a pretty pathological sequence
  (misaligned access that conflicts with a prior store).


Ah, I guess there it was ;)


Jojo: do you have any description of the C906 pipeline?  Specifically in
this case it'd be good to know how it handles unaligned accesses.

>
> On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> This patch enables the overlap-by-pieces feature of the by-pieces
>> infrastructure for inlining builtins in case the target has set
>> riscv_slow_unaligned_access_p to false.
>>
>> To demonstrate the effect for targets with fast unaligned access,
>> the following code sequences are generated for a 15-byte memset-zero.
>>
>> Without overlap_op_by_pieces we get:
>>   8e:   00053023                sd      zero,0(a0)
>>   92:   00052423                sw      zero,8(a0)
>>   96:   00051623                sh      zero,12(a0)
>>   9a:   00050723                sb      zero,14(a0)
>>
>> With overlap_op_by_pieces we get:
>>   7e:   00053023                sd      zero,0(a0)
>>   82:   000533a3                sd      zero,7(a0)
>>
>> gcc/ChangeLog:
>>
>>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>>         riscv_overlap_op_by_pieces.
>>
>> Signed-off-by: Christoph Muellner <cmuell...@gcc.gnu.org>
>> ---
>>  gcc/config/riscv/riscv.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> index 576960bb37c..98c76ba657a 100644
>> --- a/gcc/config/riscv/riscv.c
>> +++ b/gcc/config/riscv/riscv.c
>> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned 
int)
>>    return riscv_slow_unaligned_access_p;
>>  }
>>
>> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> +
>> +static bool
>> +riscv_overlap_op_by_pieces (void)
>> +{
>> +  return !riscv_slow_unaligned_access_p;
>> +}
>> +
>>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>>
>>  static bool
>> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>>
>> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> +
>>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>>
>> --
>> 2.31.1
>>

Re: [PATCH] RISC-V: Enable overlap-by-pieces in case of fast unaliged access

Reply via email to