On Wed, Jul 30, 2008 at 5:14 PM, Agner Fog <[EMAIL PROTECTED]> wrote: > Denys Vlasenko wrote: >>> >>> 3164 line source file which implements memcpy(). >>> You got to be kidding. >>> How much of L1 icache it blows away in the process? >>> I bet it performs wonderfully on microbenchmarks though. >>> > > I agree that the OpenSolaris memcpy is bigger than necessary. However, it is > necessary to have 16 branches for covering all possible alignments modulo > 16. This is because, unfortunately, there is no XMM shift instruction with a > variable count, only with a constant count, so we need one branch for each > value of the shift count. Since only one of the branches is used, it doesn't > take much space in the code cache. The speed is improved by a factor 4-5 by > this 16-branch algorithm, so it is certainly worth the extra complexity.
You forgot to look at PowerPC : http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s is that nice and small ? Dennis Clarke