On Thu, 12 Jun 2014, Jeff Law wrote: > On 06/12/14 04:12, Richard Biener wrote: > > > > This implements the requested inlining of memmove for possibly > > overlapping arguments by doing first all loads and then all stores. > > The easiest place is to do this in memory op folding where we already > > perform inlining of some memcpy cases (but fail to do the equivalent > > memcpy optimization - though RTL expansion later does it). > > > > The following patch restricts us to max. word-mode size. Ideally > > we'd have a way to check for the number of real instructions needed > > to load an (aligned) value of size N. But maybe we don't care > > and are fine with doing multiple loads / stores? > > > > Anyway, the following is conservative (but maybe not enough). > > > > Bootstrap / regtest running on x86_64-unknown-linux-gnu. > > > > These transforms don't really belong to GENERIC folding (they > > also run at -O0 ...), similar to most builtin foldings. But this > > patch is not to change that. > > > > Any comments on the size/cost issue? > I recall seeing something in one of the BZ databases that asked for > double-word to be expanded inline. Presumably the reporter's code did > lots of double-word things of this nature. > > Obviously someone else might want quad-word and so-on. However, double > words seem like a very reasonable request.
Hmm, yeah. Meanwhile I found the MOVE_MAX target macro which gives me what I was looking for (max memory-reg move size with a single instruction). Eventually increasing the maximum number of loads/stores to two also allows handling of (some) non-power-of-two sizes and some more cases of unaligned accesses on SLOW_UNALIGNED_ACCESS targets. But then we're re-implementing move_by_pieces on GIMPLE ... ;) (maybe not totally unreasonable?). I'm going to go for a single load/store and MOVE_MAX for now - I have quite some fallout to deal with anyway (analyzed strlenopt-1.c FAIL only, the other strlenopt cases are probably similar) FAIL: gcc.dg/strlenopt-1.c scan-tree-dump-times strlen "strlen \\\\(" 2 FAIL: gcc.dg/strlenopt-1.c scan-tree-dump-times strlen "memcpy \\\\(" 4 we generate (note unfolded read from "/"!) _15 = MEM[(char * {ref-all})"/"]; MEM[(char * {ref-all})_14] = _15; where strlenopt doesn't catch (short *)_14 = (short)"/\0" (maybe too much asked, given endianess and so ...). FAIL: gcc.dg/strlenopt-10.c scan-tree-dump-times strlen "strlen \\\\(" 2 FAIL: gcc.dg/strlenopt-10.c scan-tree-dump-times strlen "memcpy \\\\(" 8 FAIL: gcc.dg/strlenopt-10.c scan-tree-dump-times strlen "strchr \\\\(" 0 FAIL: gcc.dg/strlenopt-10.c scan-tree-dump-times strlen "memcpy \\\\([^\\n\\r]*, 1\\\\)" 1 FAIL: gcc.dg/strlenopt-18g.c scan-tree-dump-times strlen "memcpy \\\\(" 4 FAIL: gcc.dg/strlenopt-19.c scan-tree-dump-times strlen "memcpy \\\\(" 6 FAIL: gcc.dg/strlenopt-1f.c scan-tree-dump-times strlen "strlen \\\\(" 2 FAIL: gcc.dg/strlenopt-1f.c scan-tree-dump-times strlen "memcpy \\\\(" 4 FAIL: gcc.dg/strlenopt-2.c scan-tree-dump-times strlen "strlen \\\\(" 2 FAIL: gcc.dg/strlenopt-2.c scan-tree-dump-times strlen "memcpy \\\\(" 5 FAIL: gcc.dg/strlenopt-20.c scan-tree-dump-times strlen "memcpy \\\\(" 4 FAIL: gcc.dg/strlenopt-21.c scan-tree-dump-times strlen "memcpy \\\\(" 3 FAIL: gcc.dg/strlenopt-6.c scan-tree-dump-times strlen "memcpy \\\\(" 7 FAIL: gcc.dg/strlenopt-6.c scan-tree-dump-times strlen "strchr \\\\(" 0 FAIL: gcc.dg/strlenopt-7.c scan-tree-dump-times strlen "strlen \\\\(" 0 FAIL: gcc.dg/strlenopt-7.c scan-tree-dump-times strlen "memcpy \\\\(" 2 FAIL: gcc.dg/strlenopt-7.c scan-tree-dump-times optimized "return 3;" 1 FAIL: gcc.dg/strlenopt-8.c scan-tree-dump-times strlen "strlen \\\\(" 0 FAIL: gcc.dg/strlenopt-8.c scan-tree-dump-times strlen "memcpy \\\\(" 4 FAIL: gcc.dg/strlenopt-9.c scan-tree-dump-times strlen "memcpy \\\\(" 6 FAIL: gfortran.dg/coarray_lib_realloc_1.f90 -O scan-tree-dump-times original "__builtin_memcpy" 2 Both memcpy calls are now plain assignments. The testcase needs changing to test what it really wanted to test ... FAIL: gfortran.dg/pr45636.f90 -O scan-tree-dump-times forwprop2 "memset" 0 This shows that funny DSE patterns in forwprop no longer work when the store is not a memcpy but a plain assignment. Of course DSE also doesn't remove dead memset()s. The testcase also shows that we could/should handle memset() in the same way if it only requires a single store from a constant. Thanks, Richard.