On Thu, Aug 7, 2014 at 4:20 PM, James Greenhalgh <james.greenha...@arm.com> wrote: > On Tue, Aug 05, 2014 at 08:05:00AM +0100, Andrew Pinski wrote: >> On Fri, Aug 1, 2014 at 2:21 AM, <pins...@gmail.com> wrote: >> >> On Jun 6, 2014, at 1:50 AM, James Greenhalgh <james.greenha...@arm.com> >> >> wrote: >> >> >> >> >> >> Hi, >> >> >> >> The move_by_pieces infrastructure performs a copy by repeatedly trying >> >> the largest safe copy it can make. So for a 15-byte copy we might see: >> >> >> >> offset amount bytes copied >> >> 0 8 0-7 >> >> 8 4 8-11 >> >> 12 2 12-13 >> >> 14 1 14 >> >> >> >> However, we can implement a 15-byte copy as so: >> >> >> >> offset amount bytes copied >> >> 0 8 0-7 >> >> 7 8 7-14 >> >> >> >> Which can prove more efficient for both space and speed. >> >> >> >> In this patch we set MOVE_RATIO low to avoid using move_by_pieces, and >> >> implement the movmem pattern name to expand small block copy cases. Note, >> >> this >> >> optimization does not apply for -mstrict-align targets, which must >> >> continue >> >> copying byte-by-byte. >> > >> > Why not change move_by_pieces instead of having a target specific code? >> > This seems like a better option. You can check is unaligned slow target >> > macro to see if you want to do this optimization too. As I mentioned in >> > the other email make sure you check the volatile ness of the from and to >> > before doing this optimization. > > Hi Andrew, > > If we are converting these volatile copies to memcpy calls, then there is an > additional bug there. There is nothing in the C standard which imposes that > memcpy/memmov read and write each byte only once and it seems reasonable to > assume that the movmem optab inherits this lack of restrictions. This gotcha > either needs fixed, or at least documented for the movmem optab. > > If I'm going to fix this, I have to write it in the back-end. Failing > the expand will cause a buggy call to memcpy. Having said that, I'm not > sure I've seen a good definition of the semantics of a volatile struct > copy. It feels to me that an element-by-element copy is more likely to > match user expectations than a chunk-by-chunk copy. It is probably too > late for me to get that by the time I am in memov, so I'll have to push > the fix earlier (probably somewhere generic?). > > Do you know of a good write-up/argument/discussion on volatile struct > copy semantics? The testcase you provided is obviously broken, what is less > obvious is what should happen for: > > struct __attribute__((packed)) t16{ > long long t8; > int t4; > short t2; > unsigned char t1; > unsigned char t1a; > }; > volatile struct t16 t16; > int f(struct t16 *a) > { > t16 = *a; > } > > We have at least two choices...
It's the language frontends job to present the middle-end with something sensible. For example an element-wise copy. Also consider struct X { int i; volatile int j; int k[1024]; } x; void f (struct X *a) { x = *a; } Richard. >> Attached is the patch which does what I mentioned, I also changed >> store_by_pieces to implement a similar optimization there (for memset >> and strcpy). Also since I used SLOW_UNALIGNED_ACCESS, this is a >> generic optimization. > > I'm not sure this helps your situation on AArch64. There are still AArch64 > implementations for which we will want to bypass move_by_pieces and provide > a back-end implementation. > > We could more reasonably be controlling this with MOVE_BY_PIECES_P, but > this is only a thin wrapper around MOVE_RATIO, so the result for you is > much the same (pending a patch fixing SRA not to read MOVE_RATIO, it should > make no difference whether we disable by MOVE_RATIO or MOVE_BY_PIECES_P). > > Have you done much micro/macro-benchmarking to show that this is indeed > a sensible optimization for !SLOW_UNALIGNED_ACCESS targets? The documentation > suggests that SLOW_UNALIGNED_ACCESS should be set if unaligned accesses > are "many times" slower. This is a bit of a blunt hammer - there are likely > targets which will suffer from this optimization, but which don't set > SLOW_UNALIGNED_ACCESS. Maybe you need some finer grained cost function? > > Thanks, > James > >> I had tested an earlier version on x86_64-linux-gnu and I am in the >> middle of bootstrap/testing on this one. >> >> Thanks, >> Andrew Pinski >> >> * expr.c (move_by_pieces): >> Take the min of max_size and len to speed up things >> and to take advatage of the mode in move_by_pieces_1. >> (move_by_pieces_1): Read/write the leftovers using an overlapping >> memory locations to reduce the number of reads/writes. >> (store_by_pieces_1): Take the min of max_size and len to speed up things >> and to take advatage of the mode in store_by_pieces_2. >> (store_by_pieces_2): Write the leftovers using an overlapping >>