On Fri, Apr 12, 2013 at 01:08:15PM +0400, Michael Zolotukhin wrote: > > I did some profiling of builtin implementation, download this > > http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2 > Nice data, thanks! > Could you please describe what is memcpy_new_builtin here? Is it how > GCC expanded memcpy with this patch? > Is this a comparison between libcall, libcall with your version of > glibc, and expanded memmov with implementation from this patch? > I try to make benchmarks self contained. So now I measure libcall, libcall with my version and current builtin expansion.
I updated my benchmark, one of problems of measuring memcpy is that most memory ops happen asynchronously so this version should capute that. (padding now should be sufficient but I did not decrement it from time yet.) Now memcpy_gcc_builtin there measures builtin for first 100 sizes, then switches to my implementation. I added memcpy_new_builtin which is now same as memcpy_gcc_builtin. To add your implementation compile variant/builtin.c file into variant/builtin.s file. Then run ./benchmark. Ondra > Michael > > On 12 April 2013 12:54, Ondřej Bílka <nel...@seznam.cz> wrote: > > On Thu, Apr 11, 2013 at 04:32:30PM +0400, Michael Zolotukhin wrote: > >> > 128 is about upper bound you can expand with sse moves. > >> > Tuning did not take into account code size and measured only when code > >> > is in tigth loop. > >> > For GPR-moves limit is around 64. > >> Thanks for the data - I've not performed measurements with this > >> implementation yet, but we surely should adjust thresholds to avoid > >> performance degradations on small sizes. > >> > > > > I did some profiling of builtin implementation, download this > > http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2 > > > > see files results_rand/result.html and results_rand_noicache/result.html > > > > A memcpy_new_builtin for sizes x0,x1...x5 calls builtin and new > > otherwise. > > I did same for memcpy_glibc to see variance. > > > > memcpy_new does not call builtin. > > > > To regenerate graphs on other arch run benchmarks script. > > To use other builtin change in Makefile how to compile variant/builtin.c > > file. > > > > A builtin are faster by inlined function call, I did not add that as I > > do not know estimate of this cost. > > > >> Michael > >> > >> On 10 April 2013 22:53, Ondřej Bílka <nel...@seznam.cz> wrote: > >> > On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote: > >> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is > >> >> > much faster on small strings (variant for sandy bridge attached. > >> >> > >> >> I'm not sure I get what you meant - could you please explain what is > >> >> computed jumps? > >> > computed goto. See Duff's device it works almost exactly same. > >> >> > >> >> > You must also check performance with cold instruction cache. > >> >> > Now memcpy(x,y,128) takes 126 bytes which is too much. > >> >> > >> >> > Do not align for small sizes. Dependency caused by this erases any > >> >> > gains > >> >> > that you migth get. Keep in mind that in 55% of cases data are already > >> >> > aligned. > >> >> > >> >> Other algorithms are still available and we can use them for small > >> >> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't > >> >> use vector instructions in it. > >> > > >> > 128 is about upper bound you can expand with sse moves. > >> > Tuning did not take into account code size and measured only when code > >> > is in tigth loop. > >> > For GPR-moves limit is around 64. > >> > > >> > What matters which code has best performance/size ratio. > >> >> But that's tuning and I haven't worked on it yet - I'm going to > >> >> measure performance of all algorithms on all sizes and thus defines on > >> >> which sizes which algorithm is preferable. > >> >> What I did in this patch is introducing some infrastructure to allow > >> >> emitting of vector moves in movmem expanding - tuning is certainly > >> >> possible and needed, but that's out of the scope of the patch. > >> >> > >> >> On 10 April 2013 21:43, Ondřej Bílka <nel...@seznam.cz> wrote: > >> >> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote: > >> >> >> Hi, > >> >> >> This patch adds a new algorithm of expanding movmem in x86 and a bit > >> >> >> refactor existing implementation. This is a reincarnation of the > >> >> >> patch > >> >> >> that was sent wasn't checked couple of years ago - now I reworked it > >> >> >> from scratch and divide into several more manageable parts. > >> >> >> > >> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is > >> >> > much faster on small strings (variant for sandy bridge attached. > >> >> > > >> >> >> For now this algorithm isn't used, because cost_models are tuned to > >> >> >> use existing ones. I believe the new algorithm will give better > >> >> >> performance, but I'll leave cost-models tuning for a separate patch. > >> >> >> > >> >> > You must also check performance with cold instruction cache. > >> >> > Now memcpy(x,y,128) takes 126 bytes which is too much. > >> >> > > >> >> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as > >> >> >> well. Probably, there is another way of getting info about alignment > >> >> >> - > >> >> >> if so, please let me know. > >> >> >> > >> >> > Do not align for small sizes. Dependency caused by this erases any > >> >> > gains > >> >> > that you migth get. Keep in mind that in 55% of cases data are already > >> >> > aligned. > >> >> > > >> >> > Also in my tests best way to handle prologue is first copy last 16 > >> >> > bytes and then loop. > >> >> > > >> >> >> Similar improvements could be done in expanding of memset, but that's > >> >> >> in progress now and I'm going to proceed with it if this patch is ok. > >> >> >> > >> >> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64. > >> >> >> > >> >> >> Is it ok for trunk? > >> >> >> > >> >> >> Changelog entry: > >> >> >> > >> >> >> 2013-04-10 Michael Zolotukhin <michael.v.zolotuk...@gmail.com> > >> >> >> > >> >> >> * config/i386/i386-opts.h (enum stringop_alg): Add > >> >> >> vector_loop. > >> >> >> * config/i386/i386.c (expand_set_or_movmem_via_loop): Use > >> >> >> adjust_address instead of change_address to keep info about > >> >> >> alignment. > >> >> >> (emit_strmov): Remove. > >> >> >> (emit_memmov): New function. > >> >> >> (expand_movmem_epilogue): Refactor to properly handle bigger > >> >> >> sizes. > >> >> >> (expand_movmem_epilogue): Likewise and return updated rtx for > >> >> >> destination. > >> >> >> (expand_constant_movmem_prologue): Likewise and return > >> >> >> updated rtx for > >> >> >> destination and source. > >> >> >> (decide_alignment): Refactor, handle vector_loop. > >> >> >> (ix86_expand_movmem): Likewise. > >> >> >> (ix86_expand_setmem): Likewise. > >> >> >> * config/i386/i386.opt (Enum): Add vector_loop to option > >> >> >> stringop_alg. > >> >> >> * emit-rtl.c (get_mem_align_offset): Compute alignment for > >> >> >> MEM_REF. > >> > >> -- > >> --- > >> Best regards, > >> Michael V. Zolotukhin, > >> Software Engineer > >> Intel Corporation. > > > > -- > > > > Spider infestation in warm case parts > > > > -- > --- > Best regards, > Michael V. Zolotukhin, > Software Engineer > Intel Corporation. -- doppler effect