On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >> > libcall is not faster up to 8KB to rep sequence that is better for 
> >> > regalloc/code
> >> > cache than fully blowin function call.
> >>
> >> Be careful with this. My recollection is that REP sequence is good for
> >> any size -- for smaller size, the REP initial set up cost is too high
> >> (10s of cycles), while for large size copy, it is less efficient
> >> compared with library version.
> >
> > Well this is based on the data from the memtest script.
> > Core has good REP implementation - it is a win from rather small blocks (16
> > bytes if I recall) and it does not need alignment.
> > Library version starts to be interesting with caching hints, but I think 
> > till 80KB
> > it is still not a win for my setup (glibc-2.15)
> 
> A simple test shows that -mstringop-strategy=libcall always beats
> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> Can you share your memtest ?

I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
libcall.  The PLT call overhead is simply too high.

        Jakub

Reply via email to