On Fri, Jul 15, 2005 at 03:48:44PM -0700, Max Krasnyansky ([EMAIL PROTECTED]) wrote: > Jamal, > > Thanx for forwarding. Somehow I didn't see this one. I guess it's time for me > to check my linux-kernel and netdev subscriptions :). > > >>Basic idea behind zero-copy is remapping of the > >>physical pages where skb->data lives to the > >>userspace process. > >> > >>According to my tests, which can be found commented > >>in the code (packet_mmap()), > >>remapping of one page gets from 5 upto 20 > >>times faster than copying the same amount of data > >>(i.e. PAGE_SIZE). > > Interesting. I had exact same idea for TUN/TAP driver. > However when I looked at how much stuff needs to be done to remap those pages > my first thought was there is no way in hell it can be much much faster than > copying. > But you are correct that user-kernel copy is slow (relatively speaking) so > what I ended up doing was allocating a buffer in the kernel space, mmaping > the whole thing once into the user-space and letting user-space app manage > it via descriptors (ala Ethernet device ring buffer). Regular memcpy() is > used for skb->data to ring copy. I did some measurements of the per packet > copy overhead of the current TUN/TAP implementation that uses copy_to_user() > vs kernel buffer approach. New code saves about 1000 cycles (P4 M 1.5Ghz) > on average. I did not measure it against remapping.
Actually it can be made much faster - it is only needed to replace one TLB entry with another - in current VM code it can be done my remapping appropriate PTE, but code is not exported to the modules, so I used what was exported, which are quite heavy functions. And of course flushing - it can slow things down significantly. > >>Since current VM code requires PTE to be unmapped, > >>when remapping, but only exports unmap_mapping_range() > >>and __flush_tlb(), I used them, although they are quite > >>heavy monsters. > >>It also required mm->mmap_sem to be held, > >>so I placed main remapping code into workqueue. > Yeah. I cannot image how this can be more efficient especially on > short (100-500 bytes) packets. I guess on large packets you can brake even > or even be faster. Sure. It should only be used for at least standard 1500 MTU-sized packets, I can run some tests tomorrow with 1500 buffer size, for smaller packets it is definitely not suitable - here we should have some kind of copying, like in mmaped socket or PF_RING implementation from www.ntop.org and your new tun/tap mechanism. > >>skbs are queued in prot_hook.func() and then workqueue > >>is being scheduled, where skb is unlinked and remapped. > >>It is not freed there, as it should be, since userspace > >>will never found real data then, but instead > >>some smart algo should be investigated to defer skb freeing, > >>or simple defering using timer and redefined skb destructor. > A timer ? What do you set it to ? > You just need a descriptor for each packet with status bits (used, unused, > etc). Current schema is following - according to mmap size I have some budget of packets, i.e. PAGE_SIZE per packet, so if 5 pages were requested to be mapped, so budget is 4, one page is reserved for control block. New skbs are linked into per-socket queue where they are remapped into provided pointers, after remapping skb is queued into list of to be freed skbs. When remapping code will be called next time(i.e. new skb is being received) it checks if some timeout expires after the last freeing, if so, code frees all skbs from the free list except the last budget number of skbs. With high budget userspace will be able to read several times the same skb before budget is exhausted and skbs will be freed. Duplicate reading can be eliminated by checking control block for the same skb cookie of even just offset of skb->data in the page - it is very unlikely in my tests that budget number of skbs will have the same offset of skb->data in the page. > >>It also should remap several skbs at once, so rescheduling > >>would not appeared very frequently. > >>First mapped page is information page, where offset in page > >>of the skb->data is placed, so userspace can detect > >>where actual data lives on the next page. > >> > >>Such schema is very suitable for applications that > >>do not require the whole data flow, but only select some data > >>from the flow, based on packet content. > >>I'm quite sure it will be slower than copying for small packets, > I would say not just slower much slower :). I'm not a VM hacker, but it looks like only tlb flushing is the most expensive operation, many other things could be simplified instead of using unmap_mapping_range(). But I agree that for small packets copying is much faster and should be used instead. > Max -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html