On Mon, May 27, 2013 at 12:01:07PM -0500, Anthony Liguori wrote: > Paolo Bonzini <pbonz...@redhat.com> writes: > > > Il 27/05/2013 18:18, Anthony Liguori ha scritto: > >> Paolo Bonzini <pbonz...@redhat.com> writes: > >> > >>> Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: > >>>> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: > >>>>> Stefan put us onto the highly promising track of vhost/virtio. We have > >>>>> implemented this between Snabb Switch and the Linux kernel, but not > >>>>> directly between Snabb Switch and QEMU guests. The "roadblock" we have > >>>>> hit > >>>>> is embarrasingly basic: QEMU is using user-to-kernel system calls to > >>>>> setup > >>>>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't > >>>>> found > >>>>> a good way to map these towards Snabb Switch instead of the kernel. > >>>> > >>>> vhost_net is about connecting the a virtio-net speaking process to a > >>>> tun-like device. The problem you are trying to solve is connecting a > >>>> virtio-net speaking process to Snabb Switch. > >>>> > >>>> Either you need to replace vhost or you need a tun-like device > >>>> interface. > >>>> > >>>> How does your switch talk to hardware? > >>> > >>> And also, is your switch monolithic or does it consist of different > >>> processes? > >>> > >>> If you already have processes talking to each other, the first thing > >>> that came to my mind was a new network backend, similar to net/vde.c but > >>> more featureful (so that you support the virtio headers for offloading, > >>> for example). Then you would use "-netdev snabb,id=net0 -device > >>> e1000,netdev=net0". > >> > >> It would be very interesting to combine this with vmsplice/splice. > > > > Was zero-copy vmsplice/splice actually ever implemented? I thought it > > was reverted. > > Not sure what context you're talking about re: zero copy... a pipe can > store references to pages instead of having a buffer that stores data. > That certainly is there today--otherwise the interface is pointless. > > When splicing from pipe to pipe, you can move those references without > copying the data. > > When vmsplicing from a userspace region to a pipe, the kernel just > stores references to the pages. vmsplicing from a pipe to userspace > OTOH will copy the data. This is fixable at least when dealing with > GIFT'd pages. For guest-to-guest traffic, you wouldn't be gifting the > pages I don't think. > > For implementing guest-to-guest traffic, the source QEMU can vmsplice > the packet to a pipe that is shared with the vswitch. The vswitch can > tee(3) the first N bytes to a second pipe such that it can read the > info needed for routing decisions. > > Once the decision is made, if it's a local guest, it can splice() the > packet to the appropriate destination QEMU process or another vswitch > daemon (no data copy here). > > Finally, the destination QEMU process can vmsplice() from the pipe which > will copy the data (this is the only copy).
AFAIK splice is mostly useless for networking as there's no way to get notified when packet has been sent. > If vswitch needs to route externally, then it would need to splice() to > a macvtap. > > macvtap should be able to send the packet without copying the data. Not > sure that this last work will work as expected but if it doesn't, that's > a bug that can/should be fixed. > > The kernel cannot do better than the above modulo any overhead from > userspace context switching[*]. Also modulo scheduler latency - kernel processes packets in interrupt context. There's a reason e.g. OVS runs data-path in kernel. > Guest-to-guest requires a copy. > Normally macvtap is undesirable because it's tightly connected to a > network adapter but that is a desirable trait in this case. > > N.B., I'm not advocating making all switching decisions in > userspace. Just pointing out how it can be done efficiently. > > [*] in theory the kernel could do zero copy receive but i'm not sure > it's feasible in practice. > > Regards, > > Anthony Liguori > > > > > Paolo > > > >>> It would be slower than vhost-net, for example no zero-copy > >>> transmission. > >> > >> With splice, I think you could at least get single copy guest-to-guest > >> networking which is about as good as can be done. > >> > >> Regards, > >> > >> Anthony Liguori > >> > >>>> 3. Use the kernel as a middle-man. Create a double-ended "veth" > >>>> interface and have Snabb Switch and QEMU each open a PF_PACKET > >>>> socket and accelerate it with VHOST_NET. > >>> > >>> As Michael, mentioned, this could be macvtap on the interface that you > >>> have already created in the switch and passed to vhost-net. Then you do > >>> not have to do anything in QEMU. > >>> > >>> Paolo > >>> > >>>> If you are using the Linux network stack then it might be better to > >>>> integrate with vhost maybe as a tun-like device driver. > >>>> > >>>> Stefan > >>>> > >>>>