On Fri, Jul 25, 2008 at 9:08 AM, Agner Fog <[EMAIL PROTECTED]> wrote: > Raksit Ashok wrote: >>There is a more optimized version for 64-bit: >>http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s >>I think this looks similar to your implementation, Agner. > > Yes it is similar to my code.
3164 line source file which implements memcpy(). You got to be kidding. How much of L1 icache it blows away in the process? I bet it performs wonderfully on microbenchmarks though. 2991 .balign 16 # sadistic alignment strikes again 2992 L(bkPxQx): .int L(bkP0Q0)-L(bkPxQx) # why use two bytes when we can use four? Seriously. What possible reason there can be to align a randomly accessed data table to 16 bytes? 4 bytes I understand, but 16? -- vda