Jamal,

Thanx for forwarding. Somehow I didn't see this one. I guess it's time for me
to check my linux-kernel and netdev subscriptions :).

Basic idea behind zero-copy is remapping of the physical pages where skb->data lives to the
userspace process.

According to my tests, which can be found commented
in the code (packet_mmap()), remapping of one page gets from 5 upto 20
times faster than copying the same amount of data
(i.e. PAGE_SIZE).

Interesting. I had exact same idea for TUN/TAP driver.
However when I looked at how much stuff needs to be done to remap those pages
my first thought was there is no way in hell it can be much much faster than
copying.
But you are correct that user-kernel copy is slow (relatively speaking) so
what I ended up doing was allocating a buffer in the kernel space, mmaping
the whole thing once into the user-space and letting user-space app manage
it via descriptors (ala Ethernet device ring buffer). Regular memcpy() is
used for skb->data to ring copy. I did some measurements of the per packet
copy overhead of the current TUN/TAP implementation that uses copy_to_user()
vs kernel buffer approach. New code saves about 1000 cycles (P4 M 1.5Ghz)
on average. I did not measure it against remapping.

Since current VM code requires PTE to be unmapped,
when remapping, but only exports unmap_mapping_range()
and __flush_tlb(), I used them, although they are quite
heavy monsters.
It also required mm->mmap_sem to be held, so I placed main remapping code into workqueue.
Yeah. I cannot image how this can be more efficient especially on
short (100-500 bytes) packets. I guess on large packets you can brake even
or even be faster.

skbs are queued in prot_hook.func() and then workqueue
is being scheduled, where skb is unlinked and remapped.
It is not freed there, as it should be, since userspace
will never found real data then, but instead
some smart algo should be investigated to defer skb freeing,
or simple defering using timer and redefined skb destructor.
A timer ? What do you set it to ?
You just need a descriptor for each packet with status bits (used, unused, etc).

It also should remap several skbs at once, so rescheduling
would not appeared very frequently.
First mapped page is information page, where offset in page
of the skb->data is placed, so userspace can detect
where actual data lives on the next page.

Such schema is very suitable for applications that
do not require the whole data flow, but only select some data
from the flow, based on packet content.
I'm quite sure it will be slower than copying for small packets,
I would say not just slower much slower :).

Max





-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to