Re: [PATCH, x86] Use vector moves in memmove expanding

Michael Zolotukhin Fri, 05 Jul 2013 00:59:47 -0700

Ping.

On 2 July 2013 18:37, Michael Zolotukhin <michael.v.zolotuk...@gmail.com> wrote:
> Hi guys,
> Thanks for the review and for your responses - please find my answers
> below.
>
>> Yes, I looked on the patch in detail this week (I am currently on a travel 
>> with
>> sporadic internet access and my days before leaving was extremely hectic).  
>> The
>> patch is OK except for the expr.c bits I can not apporve myself and they 
>> ought
>> to go in separately anyway.
> You probably meant changes in emit-rtl.c.  Originally they were made to
> overcome an issue with defining alignment of MEM_REF, but now they seem
> to be unneeded - everything works well even without them.
>
>> The reason I took so long to decide on the patch is that I did some 
>> experiments
>> with the SSE loop and it is hard to find scenarios with the current
>> alignment/block size code where it is a win over library call on the current
>> implementation (or with Ondra's). So I did not find block size/chip 
>> combination
>> where the current inline SSE loop would be default codegen strategy. One of
>> cases where inline loop wins is when the extra register pressure impossed by
>> the call cause IRA to not do very good job on surrounding code.
> As I mentioned earlier, this implementation doesn't pretend to be a
> finished and completely tuned implementation - it's just a base for
> future work in this direction.  But nevertheless, I think that even now
> it could be used for cases when alignment is statically known and
> blocksize is big enough - I doubt that libcall would beat inlining in
> that case due to call overheads (but yes, currently that's just my
> assumptions).
>
>> Michael, my apologizes for taking so long to decide here.
> No problem, Jan, thanks for the review.
>
>> Do you think you can work on the
>> memset and move_by_pieces/store_by_pieces bits?
>> I think the by pieces code is a lot more promising then the actual inline 
>> SSE loops.
> I think I'll find some time for implementing memset (in i386.c) - but
> I'm going to restrict SSE loops only for the cases when we are zeroing
> memory.  That means, we'll give up if we're filling memory with some
> other value.  I suppose that zeroing is the most common use of memset,
> so it'll be enough.  What do you think, does it sound reasonable?
>
> As for move_by_pieces and store_by_pieces - I thought about them.
> Implementation there would be much more complicated and it'd be much
> harder to tune it - the reason is obvious: we need to make common
> implementation which will work for all architectures.
> Instead of this, I've been thinking of adding yet another algorithm for
> expanding memcpy/memset in i386.c - something like fully_unrolled_loop.
> It'll perform copying without any loop at all and assumingly will be
> useful for copying small blocks.  With this implemented, we could change
> MOVE_RATIO to always expand memmov/memset with some algorithm from
> i386.c and not to use move_by_pieces/store_by_pieces.  I think such
> implementation could be made much more optimized than any other in
> move_by_pieces.
>
>> It does not measure real world issues like icache pressure etc. I would
>> like more definitive data. One possibility is compile gcc with various
>> options and repeately run it to compile simple project for day or so.
> I agree, but anyway we need some tests to measure performance - whether
> these tests are Specs or some others.
>
>
> I attached the updated patch - it's the same as the previous, but
> without emit-rtl.c changes. Is it ok for trunk?
>
> The patch is bootstrapped and regtested on i386 and x86-64.  Specs2000 are 
> also
> passing without regressions (I checked only stability, not performance).
>
> Changelog:
> 2013-07-02  Michael Zolotukhin  <michael.v.zolotuk...@gmail.com>
>
>         * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
>         * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
>         adjust_address instead of change_address to keep info about alignment.
>         (emit_strmov): Remove.
>         (emit_memmov): New function.
>         (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
>         (expand_movmem_epilogue): Likewise and return updated rtx for
>         destination.
>         (expand_constant_movmem_prologue): Likewise and return updated rtx for
>         destination and source.
>         (decide_alignment): Refactor, handle vector_loop.
>         (ix86_expand_movmem): Likewise.
>         (ix86_expand_setmem): Likewise.
>         * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.
>
>
> Thanks, Michael
>> Honza
>>
>> Ondra
>
> On 30 June 2013 23:22, Ondřej Bílka <nel...@seznam.cz> wrote:
>> On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote:
>>> > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin
>>> > <michael.v.zolotuk...@gmail.com> wrote:
>>> > > Ping.
>>> > >
>>> > > On 20 June 2013 20:56, Michael Zolotukhin
>>> > > <michael.v.zolotuk...@gmail.com> wrote:
>>> > >> It seems that one of the tests needed a small fix.
>>> > >> Attached is a corrected version.
>>> >
>>> > Jan, do you plan to review this patch? It touches the area that you
>>> > worked on extensively some time ago, so your expert opinion would be
>>> > much appreciated here.
>>>
>>> Yes, I looked on the patch in detail this week (I am currently on a travel 
>>> with
>>> sporadic internet access and my days before leaving was extremely hectic).  
>>> The
>>> patch is OK except for the expr.c bits I can not apporve myself and they 
>>> ought
>>> to go in separately anyway.
>>>
>>> The reason I took so long to decide on the patch is that I did some 
>>> experiments
>>> with the SSE loop and it is hard to find scenarios with the current
>>> alignment/block size code where it is a win over library call on the current
>>> implementation (or with Ondra's). So I did not find block size/chip 
>>> combination
>>> where the current inline SSE loop would be default codegen strategy. One of
>>> cases where inline loop wins is when the extra register pressure impossed by
>>> the call cause IRA to not do very good job on surrounding code.
>>>
>> It does not measure real world issues like icache pressure etc. I would
>> like more definitive data. One possibility is compile gcc with various
>> options and repeately run it to compile simple project for day or so.
>>
>> I am interested upto which size inlining code makes sense, my guess is
>> that after 64 bytes and no FP registers inlining is unproductive.
>>
>> I tried this to see how LD_PRELOADing memcpy affects performance and it
>> is measurable after about day, a performance impact is small in my case
>> it was about 0.5% so you really need that long to converge.
>>
>> Ondra
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.




-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

Re: [PATCH, x86] Use vector moves in memmove expanding

Reply via email to