On Mon, Jan 25, 2010 at 03:00:16PM -0600, Anthony Liguori wrote: > On 01/25/2010 02:27 PM, Michael S. Tsirkin wrote: >> On Mon, Jan 25, 2010 at 01:58:08PM -0600, Anthony Liguori wrote: >> >>> On 01/11/2010 11:23 AM, Michael S. Tsirkin wrote: >>> >>>> start/stop backend on driver start/stop >>>> >>>> Signed-off-by: Michael S. Tsirkin<m...@redhat.com> >>>> --- >>>> hw/virtio-net.c | 40 ++++++++++++++++++++++++++++++++++++++++ >>>> 1 files changed, 40 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/hw/virtio-net.c b/hw/virtio-net.c >>>> index c2a389f..99169e1 100644 >>>> --- a/hw/virtio-net.c >>>> +++ b/hw/virtio-net.c >>>> @@ -17,6 +17,7 @@ >>>> #include "net/tap.h" >>>> #include "qemu-timer.h" >>>> #include "virtio-net.h" >>>> +#include "vhost_net.h" >>>> >>>> #define VIRTIO_NET_VM_VERSION 11 >>>> >>>> @@ -47,6 +48,7 @@ typedef struct VirtIONet >>>> uint8_t nomulti; >>>> uint8_t nouni; >>>> uint8_t nobcast; >>>> + uint8_t vhost_started; >>>> struct { >>>> int in_use; >>>> int first_multi; >>>> @@ -114,6 +116,10 @@ static void virtio_net_reset(VirtIODevice *vdev) >>>> n->nomulti = 0; >>>> n->nouni = 0; >>>> n->nobcast = 0; >>>> + if (n->vhost_started) { >>>> + vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), vdev); >>>> + n->vhost_started = 0; >>>> + } >>>> >>>> /* Flush any MAC and VLAN filter table state */ >>>> n->mac_table.in_use = 0; >>>> @@ -820,6 +826,36 @@ static NetClientInfo net_virtio_info = { >>>> .link_status_changed = virtio_net_set_link_status, >>>> }; >>>> >>>> +static void virtio_net_set_status(struct VirtIODevice *vdev) >>>> +{ >>>> + VirtIONet *n = to_virtio_net(vdev); >>>> + if (!n->nic->nc.peer) { >>>> + return; >>>> + } >>>> + if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) { >>>> + return; >>>> + } >>>> + >>>> + if (!tap_get_vhost_net(n->nic->nc.peer)) { >>>> + return; >>>> + } >>>> + if (!!n->vhost_started == !!(vdev->status& >>>> VIRTIO_CONFIG_S_DRIVER_OK)) { >>>> + return; >>>> + } >>>> + if (vdev->status& VIRTIO_CONFIG_S_DRIVER_OK) { >>>> + int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), vdev); >>>> + if (r< 0) { >>>> + fprintf(stderr, "unable to start vhost net: " >>>> + "falling back on userspace virtio\n"); >>>> + } else { >>>> + n->vhost_started = 1; >>>> + } >>>> + } else { >>>> + vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), vdev); >>>> + n->vhost_started = 0; >>>> + } >>>> +} >>>> + >>>> >>>> >>> This function does a number of bad things. It makes virtio-net have >>> specific knowledge of backends (like tap) and then has virtio-net pass >>> device specific state (vdev) to a network backend. >>> >>> Ultimately, the following things need to happen: >>> >>> 1) when a driver is ready to begin operating: >>> a) virtio-net needs to tell vhost the location of the ring in physical >>> memory >>> b) virtio-net needs to tell vhost about any notification it receives >>> (allowing kvm to shortcut userspace) >>> c) virtio-net needs to tell vhost about which irq is being used >>> (allowing kvm to shortcut userspace) >>> >>> What this suggests is that we need an API for the network backends to do >>> the following: >>> >>> 1) probe whether ring passthrough is supported >>> 2) set the *virtual* address of the ring elements >>> 3) hand the backend some sort of notification object for sending and >>> receiving notifications >>> 4) stop ring passthrough >>> >> Look at how vnet_hdr is setup: frontend probes backend, and has 'if >> (backend has vnet header) blabla' vhost is really very similar, so I >> would like to do it in the same way. >> > > vnet_hdr is IMHO a really bad example to copy from. > > vnet_hdr leaks into the migration state via n->has_vnet_hdr. What this > means is that if you want to migrate from -net tap -net nic,model=virtio > to -net user -net nic,model=virtio, it will fail. > > This is a hard problem to solve in qemu though because it would require > that we implement software GSO which so far, no one has stepped up to do. > > We don't have to repeat this with vhost-net though because unlike > vnet_hdr, we don't have to expose anything to the guest. > >> Generally I do not believe in abstractions that only have one >> implementation behind it: you only know how to abstract interface after you >> have more than one implementation. So whoever writes another frontend >> that can use vhost will be in a better position to add infrastructure to >> abstract both that new thing and virtio. >> > > I agree with you, but at the same time, I also believe that layering > violations should be avoided. I'm not suggesting that you need to make > anything other than the vhost-net + virtio-net case work. Just make the > interfaces abstract enough that you don't need to hand a vdev to > vhost-net and such that you don't have to pass kvm specific data > structure (ioeventfd) in what are supposed to be generic interfaces. > >>> vhost should not need any direct knowledge of the device. This >>> interface should be enough to communicating the required data. I think >>> the only bit that is a little fuzzy and perhaps non-obvious for the >>> current patches is the notification object. I would expect it look >>> something like: >>> >>> typedef struct IOEvent { >>> int type; >>> void (*notify)(IOEvent *); >>> void (*on_notify)(IOEvent *, void (*cb)(IOEvent *, void *)); >>> } IOEvent; >>> And then we would have >>> >>> typedef struct KVMIOEvent { >>> IOEvent event = {.type = KVM_IO_EVENT}; >>> int fd; >>> } KVMIOEvent; >>> >>> if (kvm_enabled()) in virtio-net, we would allocate a KVMIOEvent for the >>> PIO notification and hand that to vhost. vhost would check event.type >>> and if it's KVM_IOEVENT, downcast and use that to get at the file >>> descriptor. >>> >> Since we don't have any other IOEvents, I just put the fd >> in the generic structure directly. If other types surface >> we'll see how to generalize it. >> > > I'd feel a whole lot better if we didn't pass the fd around and instead > passed around a structure. With just a tiny bit of layering, we can > even avoid making that structure KVM specific :-) > >>> The KVMIOEvent should be created, not in the set status callback, but in >>> the PCI map callback. And in the PCI map callback, cpu_single_env >>> should be passed to a kvm specific function to create this KVMIOEvent >>> object that contains the created eventfd() that's handed to kvm via >>> ioctl. >>> >> So this specific thing does not work very well because with irqchip, we >> want to bypass notification and send irq directly from kernel. >> So I created a structure but it does not have callbacks, >> only the fd. >> > > Your structure is an int, right?
it *has* an int: struct EventNotifier { int fd; }; > I understand the limits due to lack of > in-kernel irqchip but I still think passing around an fd is a mistake. So API's will all use EventNotifier *. We'll be able to add downcasting etc if/when we need it. >>> There >>> should also be strong separation between the vhost-net code and the >>> virtio-net device. >>> >>> Regards, >>> >>> Anthony Liguori >>> >>> >> I don't see the point of this last idea. vhost is virtio accelerator, >> not a generic network backend. Whoever wants to use vhost for >> non-virtio frontends will have to add infrastructure for this >> separation, but I do not believe it's practical or desirable. >> > > vhost is a userspace interface to inject packets into a network device > just like a raw socket or a tun/tap device is. It happens to have some > interesting features like it supports remapping of physical addresses > and it implements interfaces for notifications that can conveniently be > used by KVM to bypass userspace in the fast paths. > > We should think of it this way for the same reason that vhost-net > doesn't live in kvm.ko. If it's easy to isolate something and make it > more generic, it's almost always the right thing to do. In this case, > isolating vhost-net from virtio-net in qemu just involves passing some > information in a function call verses passing a non-public data > structure that is then accessed directly. Regardless of whether you > agree with my view of vhost-net, the argument alone to avoid making a > non-public structure public should be enough of an argument. > > Regards, > > Anthony Liguori I'll add accessors, it's not a big deal.