On Wed, Jul 30, 2008 at 5:14 PM, Agner Fog <[EMAIL PROTECTED]> wrote:
> Denys Vlasenko wrote:
>>>
>>> 3164 line source file which implements memcpy().
>>> You got to be kidding.
>>> How much of L1 icache it blows away in the process?
>>> I bet it performs wonderfully on microbenchmarks though.
>>>
>
> I agree that the OpenSolaris memcpy is bigger than necessary. However, it is
> necessary to have 16 branches for covering all possible alignments modulo
> 16. This is because, unfortunately, there is no XMM shift instruction with a
> variable count, only with a constant count, so we need one branch for each
> value of the shift count. Since only one of the branches is used, it doesn't
> take much space in the code cache. The speed is improved by a factor 4-5 by
> this 16-branch algorithm, so it is certainly worth the extra complexity.

You forgot to look at PowerPC :

http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s

is that nice and small ?


Dennis Clarke

Reply via email to