Gunnar von Boehn writes: > The "regular" code was much slower for the normal case and has a special > version for the 4K optimized case.
That's a slightly inaccurate view... The reason for having the two cases is that when I profiled the distribution of sizes and alignments of memory copies in the kernel, the result was that almost all copies (something like 99%, IIRC) were either 128 bytes or less, or else a whole page at a page-aligned address. Thus we get the best performance by having a simple copy routine with minimal setup overhead for the small copy case, plus an aggressively optimized page copy routine. Spending time setting up for a multi-cacheline copy that's not a whole page is just going to hurt the small copy case without providing any real benefit. Transferring data over loopback is possibly an exception to that. However, it's very rare to transfer large amounts of data over loopback, unless you're running a benchmark like iperf or netperf. :-/ Paul. _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev