Ping. On 2 July 2013 18:37, Michael Zolotukhin <michael.v.zolotuk...@gmail.com> wrote: > Hi guys, > Thanks for the review and for your responses - please find my answers > below. > >> Yes, I looked on the patch in detail this week (I am currently on a travel >> with >> sporadic internet access and my days before leaving was extremely hectic). >> The >> patch is OK except for the expr.c bits I can not apporve myself and they >> ought >> to go in separately anyway. > You probably meant changes in emit-rtl.c. Originally they were made to > overcome an issue with defining alignment of MEM_REF, but now they seem > to be unneeded - everything works well even without them. > >> The reason I took so long to decide on the patch is that I did some >> experiments >> with the SSE loop and it is hard to find scenarios with the current >> alignment/block size code where it is a win over library call on the current >> implementation (or with Ondra's). So I did not find block size/chip >> combination >> where the current inline SSE loop would be default codegen strategy. One of >> cases where inline loop wins is when the extra register pressure impossed by >> the call cause IRA to not do very good job on surrounding code. > As I mentioned earlier, this implementation doesn't pretend to be a > finished and completely tuned implementation - it's just a base for > future work in this direction. But nevertheless, I think that even now > it could be used for cases when alignment is statically known and > blocksize is big enough - I doubt that libcall would beat inlining in > that case due to call overheads (but yes, currently that's just my > assumptions). > >> Michael, my apologizes for taking so long to decide here. > No problem, Jan, thanks for the review. > >> Do you think you can work on the >> memset and move_by_pieces/store_by_pieces bits? >> I think the by pieces code is a lot more promising then the actual inline >> SSE loops. > I think I'll find some time for implementing memset (in i386.c) - but > I'm going to restrict SSE loops only for the cases when we are zeroing > memory. That means, we'll give up if we're filling memory with some > other value. I suppose that zeroing is the most common use of memset, > so it'll be enough. What do you think, does it sound reasonable? > > As for move_by_pieces and store_by_pieces - I thought about them. > Implementation there would be much more complicated and it'd be much > harder to tune it - the reason is obvious: we need to make common > implementation which will work for all architectures. > Instead of this, I've been thinking of adding yet another algorithm for > expanding memcpy/memset in i386.c - something like fully_unrolled_loop. > It'll perform copying without any loop at all and assumingly will be > useful for copying small blocks. With this implemented, we could change > MOVE_RATIO to always expand memmov/memset with some algorithm from > i386.c and not to use move_by_pieces/store_by_pieces. I think such > implementation could be made much more optimized than any other in > move_by_pieces. > >> It does not measure real world issues like icache pressure etc. I would >> like more definitive data. One possibility is compile gcc with various >> options and repeately run it to compile simple project for day or so. > I agree, but anyway we need some tests to measure performance - whether > these tests are Specs or some others. > > > I attached the updated patch - it's the same as the previous, but > without emit-rtl.c changes. Is it ok for trunk? > > The patch is bootstrapped and regtested on i386 and x86-64. Specs2000 are > also > passing without regressions (I checked only stability, not performance). > > Changelog: > 2013-07-02 Michael Zolotukhin <michael.v.zolotuk...@gmail.com> > > * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop. > * config/i386/i386.c (expand_set_or_movmem_via_loop): Use > adjust_address instead of change_address to keep info about alignment. > (emit_strmov): Remove. > (emit_memmov): New function. > (expand_movmem_epilogue): Refactor to properly handle bigger sizes. > (expand_movmem_epilogue): Likewise and return updated rtx for > destination. > (expand_constant_movmem_prologue): Likewise and return updated rtx for > destination and source. > (decide_alignment): Refactor, handle vector_loop. > (ix86_expand_movmem): Likewise. > (ix86_expand_setmem): Likewise. > * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg. > > > Thanks, Michael >> Honza >> >> Ondra > > On 30 June 2013 23:22, Ondřej Bílka <nel...@seznam.cz> wrote: >> On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote: >>> > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin >>> > <michael.v.zolotuk...@gmail.com> wrote: >>> > > Ping. >>> > > >>> > > On 20 June 2013 20:56, Michael Zolotukhin >>> > > <michael.v.zolotuk...@gmail.com> wrote: >>> > >> It seems that one of the tests needed a small fix. >>> > >> Attached is a corrected version. >>> > >>> > Jan, do you plan to review this patch? It touches the area that you >>> > worked on extensively some time ago, so your expert opinion would be >>> > much appreciated here. >>> >>> Yes, I looked on the patch in detail this week (I am currently on a travel >>> with >>> sporadic internet access and my days before leaving was extremely hectic). >>> The >>> patch is OK except for the expr.c bits I can not apporve myself and they >>> ought >>> to go in separately anyway. >>> >>> The reason I took so long to decide on the patch is that I did some >>> experiments >>> with the SSE loop and it is hard to find scenarios with the current >>> alignment/block size code where it is a win over library call on the current >>> implementation (or with Ondra's). So I did not find block size/chip >>> combination >>> where the current inline SSE loop would be default codegen strategy. One of >>> cases where inline loop wins is when the extra register pressure impossed by >>> the call cause IRA to not do very good job on surrounding code. >>> >> It does not measure real world issues like icache pressure etc. I would >> like more definitive data. One possibility is compile gcc with various >> options and repeately run it to compile simple project for day or so. >> >> I am interested upto which size inlining code makes sense, my guess is >> that after 64 bytes and no FP registers inlining is unproductive. >> >> I tried this to see how LD_PRELOADing memcpy affects performance and it >> is measurable after about day, a performance impact is small in my case >> it was about 0.5% so you really need that long to converge. >> >> Ondra > > > -- > --- > Best regards, > Michael V. Zolotukhin, > Software Engineer > Intel Corporation.
-- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.