Arnd Bergmann writes: > On Friday 20 June 2008, Paul Mackerras wrote: > > > Transferring data over loopback is possibly an exception to that. > > However, it's very rare to transfer large amounts of data over > > loopback, unless you're running a benchmark like iperf or netperf. :-/ > > Well, it is the exact case that came up in a real world scenario > for cell: On a network intensive application where the SPUs are > supposed to do all the work, we ended up not getting enough > data in and out through gbit ethernet because the PPU spent ^^^^^^^^^^^^^ Which isn't loopback... :)
I have no objection to improving copy_tofrom_user, memcpy and copy_page. I just want to make sure that we don't make things worse on some platform. In fact, Mark and I dug up some experiments I had done 5 or 6 years ago and just ran through all the copy loops I tried back then, on QS22, POWER6, POWER5+, POWER5, POWER4, 970, and POWER3, and compared them to the current kernel routines and the proposed new Cell routines. So far we have just looked at the copy_page case (i.e. 4kB on a 4kB alignment) for cache-cold and cache-hot cases. Interestingly, some of the routines I discarded back then turn out to do really well on most of the modern platforms, and quite a lot better on Cell than Gunnar's code does (~10GB/s vs. ~5.5GB/s in the hot-cache case, IIRC). Mark is going to summarise the results and also measure the speed for smaller copies and misaligned copies. As for the distribution of sizes, I think it would be worthwhile to run a fresh set of tests. As I said, my previous results showed most copies to be either small (<= 128B) or a multiple of 4k, and I think that was true for copy_tofrom_user as well as memcpy, but that was a while ago. > much of its time in copy_to_user. > > Going to 10gbit will make the problem even more apparent. Is this application really transferring bulk data and using buffers that aren't a multiple of the page size? Do you know whether the copies ended up being misaligned? Of course, if we really want the fastest copy possible, the thing to do is to use VMX loads and stores on 970, POWER6 and Cell. The overhead of setting up to use VMX in the kernel would probably kill any advantage, though -- at least, that's what I found when I tried using VMX for copy_page in the kernel on 970 a few years ago. > Doing some static compile-time analysis, I found that most > of the call sites (which are not necessarily most of > the run time calls) pass either a small constant size of > less than a few cache lines, or have a variable size but are > not at all performance critical. > Since the prefetching and cache line size awareness was > most of the improvement for cell (AFAIU), maybe we can > annotate the few interesting cases, say by introducing a > new copy_from_user_large() function that can be easily > optimized for large transfers on a given CPU, while > the remaining code keeps optmizing for small transfers > and may even get rid of the full page copy optimization > in order to save a branch. Let's see what Mark comes up with. We may be able to find a way to do it that works well across all current CPUs and also is OK for small copies. If not we might need to do what you suggest. Regards, Paul. _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev