Re: [AArch64] Implement movmem for the benefit of inline memcpy

Richard Biener Thu, 07 Aug 2014 07:35:28 -0700

On Thu, Aug 7, 2014 at 4:20 PM, James Greenhalgh
<james.greenha...@arm.com> wrote:
> On Tue, Aug 05, 2014 at 08:05:00AM +0100, Andrew Pinski wrote:
>> On Fri, Aug 1, 2014 at 2:21 AM,  <pins...@gmail.com> wrote:
>> >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenha...@arm.com> 
>> >> wrote:
>> >>
>> >>
>> >> Hi,
>> >>
>> >> The move_by_pieces infrastructure performs a copy by repeatedly trying
>> >> the largest safe copy it can make. So for a 15-byte copy we might see:
>> >>
>> >> offset   amount  bytes copied
>> >> 0        8       0-7
>> >> 8        4       8-11
>> >> 12       2       12-13
>> >> 14       1       14
>> >>
>> >> However, we can implement a 15-byte copy as so:
>> >>
>> >> offset   amount  bytes copied
>> >> 0        8       0-7
>> >> 7        8       7-14
>> >>
>> >> Which can prove more efficient for both space and speed.
>> >>
>> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and
>> >> implement the movmem pattern name to expand small block copy cases. Note, 
>> >> this
>> >> optimization does not apply for -mstrict-align targets, which must 
>> >> continue
>> >> copying byte-by-byte.
>> >
>> > Why not change move_by_pieces instead of having a target specific code?
>> > This seems like a better option. You can check is unaligned slow target
>> > macro to see if you want to do this optimization too.   As I mentioned in
>> > the other email make sure you check the volatile ness of the from and to
>> > before doing this optimization.
>
> Hi Andrew,
>
> If we are converting these volatile copies to memcpy calls, then there is an
> additional bug there. There is nothing in the C standard which imposes that
> memcpy/memmov read and write each byte only once and it seems reasonable to
> assume that the movmem optab inherits this lack of restrictions.  This gotcha
> either needs fixed, or at least documented for the movmem optab.
>
> If I'm going to fix this, I have to write it in the back-end. Failing
> the expand will cause a buggy call to memcpy.  Having said that, I'm not
> sure I've seen a good definition of the semantics of a volatile struct
> copy. It feels to me that an element-by-element copy is more likely to
> match user expectations than a chunk-by-chunk copy. It is probably too
> late for me to get that by the time I am in memov, so I'll have to push
> the fix earlier (probably somewhere generic?).
>
> Do you know of a good write-up/argument/discussion on volatile struct
> copy semantics? The testcase you provided is obviously broken, what is less
> obvious is what should happen for:
>
> struct __attribute__((packed)) t16{
>   long long t8;
>   int t4;
>   short t2;
>   unsigned char t1;
>   unsigned char t1a;
> };
> volatile struct t16 t16;
> int f(struct t16 *a)
> {
>   t16 = *a;
> }
>
> We have at least two choices...


It's the language frontends job to present the middle-end with
something sensible.  For example an element-wise copy.

Also consider

struct X { int i; volatile int j; int k[1024]; } x;
void f (struct X *a)
{
  x = *a;
}

Richard.

>> Attached is the patch which does what I mentioned, I also changed
>> store_by_pieces to implement a similar optimization there (for memset
>> and strcpy).  Also since I used SLOW_UNALIGNED_ACCESS, this is a
>> generic optimization.
>
> I'm not sure this helps your situation on AArch64. There are still AArch64
> implementations for which we will want to bypass move_by_pieces and provide
> a back-end implementation.
>
> We could more reasonably be controlling this with MOVE_BY_PIECES_P, but
> this is only a thin wrapper around MOVE_RATIO, so the result for you is
> much the same (pending a patch fixing SRA not to read MOVE_RATIO, it should
> make no difference whether we disable by MOVE_RATIO or MOVE_BY_PIECES_P).
>
> Have you done much micro/macro-benchmarking to show that this is indeed
> a sensible optimization for !SLOW_UNALIGNED_ACCESS targets? The documentation
> suggests that SLOW_UNALIGNED_ACCESS should be set if unaligned accesses
> are "many times" slower. This is a bit of a blunt hammer - there are likely
> targets which will suffer from this optimization, but which don't set
> SLOW_UNALIGNED_ACCESS. Maybe you need some finer grained cost function?
>
> Thanks,
> James
>
>> I had tested an earlier version on x86_64-linux-gnu and I am in the
>> middle of bootstrap/testing on this one.
>>
>> Thanks,
>> Andrew Pinski
>>
>> * expr.c (move_by_pieces):
>> Take the min of max_size and len to speed up things
>> and to take advatage of the mode in move_by_pieces_1.
>> (move_by_pieces_1): Read/write the leftovers using an overlapping
>> memory locations to reduce the number of reads/writes.
>> (store_by_pieces_1): Take the min of max_size and len to speed up things
>> and to take advatage of the mode in store_by_pieces_2.
>> (store_by_pieces_2): Write the leftovers using an overlapping
>>

Re: [AArch64] Implement movmem for the benefit of inline memcpy

Reply via email to