* Michael S. Tsirkin (m...@redhat.com) wrote: > On Thu, May 30, 2019 at 08:08:23PM +0100, Dr. David Alan Gilbert wrote: > > * Michael S. Tsirkin (m...@redhat.com) wrote: > > > On Thu, May 30, 2019 at 07:00:23PM +0100, Dr. David Alan Gilbert wrote: > > > > * Michael S. Tsirkin (m...@redhat.com) wrote: > > > > > On Thu, May 30, 2019 at 04:56:45PM +0200, Jens Freimann wrote: > > > > > > Hi David, > > > > > > > > > > > > sorry for the delayed reply. > > > > > > > > > > > > On Tue, May 28, 2019 at 11:04:15AM -0400, Michael S. Tsirkin wrote: > > > > > > > On Tue, May 21, 2019 at 10:45:05AM +0100, Dr. David Alan Gilbert > > > > > > > wrote: > > > > > > > > * Jens Freimann (jfreim...@redhat.com) wrote: > > > > > > > > > +static void virtio_net_primary_plug_timer(void *opaque); > > > > > > > > > + > > > > > > > > > static void virtio_net_set_link_status(NetClientState *nc) > > > > > > > > > { > > > > > > > > > VirtIONet *n = qemu_get_nic_opaque(nc); > > > > > > > > > @@ -786,6 +796,14 @@ static void > > > > > > > > > virtio_net_set_features(VirtIODevice *vdev, uint64_t features) > > > > > > > > > } else { > > > > > > > > > memset(n->vlans, 0xff, MAX_VLAN >> 3); > > > > > > > > > } > > > > > > > > > + > > > > > > > > > + if (virtio_has_feature(features, VIRTIO_NET_F_STANDBY)) { > > > > > > > > > + atomic_set(&n->primary_should_be_hidden, false); > > > > > > > > > + if (n->primary_device_timer) > > > > > > > > > + timer_mod(n->primary_device_timer, > > > > > > > > > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + > > > > > > > > > + 4000); > > > > > > > > > + } > > > > > > > > > > > > > > > > What's this magic timer constant and why? > > > > > > > > > > > > To be honest it's a leftover from previous versions (before I took > > > > > > over) of the patches and I'm not sure why the timer is there. > > > > > > I removed it and so far see no reason to keep it. > > > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t > > > > > > > > > cmd, > > > > > > > > > @@ -2626,6 +2644,87 @@ void > > > > > > > > > virtio_net_set_netclient_name(VirtIONet *n, const char *name, > > > > > > > > > n->netclient_type = g_strdup(type); > > > > > > > > > } > > > > > > > > > > > > > > > > > > +static void virtio_net_primary_plug_timer(void *opaque) > > > > > > > > > +{ > > > > > > > > > + VirtIONet *n = opaque; > > > > > > > > > + Error *err = NULL; > > > > > > > > > + > > > > > > > > > + if (n->primary_device_dict) > > > > > > > > > + n->primary_device_opts = > > > > > > > > > qemu_opts_from_qdict(qemu_find_opts("device"), > > > > > > > > > + n->primary_device_dict, &err); > > > > > > > > > + if (n->primary_device_opts) { > > > > > > > > > + n->primary_dev = > > > > > > > > > qdev_device_add(n->primary_device_opts, &err); > > > > > > > > > + error_setg(&err, "virtio_net: couldn't plug in > > > > > > > > > primary device"); > > > > > > > > > + return; > > > > > > > > > + } > > > > > > > > > + if (!n->primary_device_dict && err) { > > > > > > > > > + if (n->primary_device_timer) { > > > > > > > > > + timer_mod(n->primary_device_timer, > > > > > > > > > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + > > > > > > > > > + 100); > > > > > > > > > > > > > > > > same here. > > > > > > > > > > > > see above > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + } > > > > > > > > > + } > > > > > > > > > +} > > > > > > > > > + > > > > > > > > > +static void virtio_net_handle_migration_primary(VirtIONet *n, > > > > > > > > > + > > > > > > > > > MigrationState *s) > > > > > > > > > +{ > > > > > > > > > + Error *err = NULL; > > > > > > > > > + bool should_be_hidden = > > > > > > > > > atomic_read(&n->primary_should_be_hidden); > > > > > > > > > + > > > > > > > > > + n->primary_dev = > > > > > > > > > qdev_find_recursive(sysbus_get_default(), > > > > > > > > > + n->primary_device_id); > > > > > > > > > + if (!n->primary_dev) { > > > > > > > > > + error_setg(&err, "virtio_net: couldn't find primary > > > > > > > > > device"); > > > > > > > > > > > > > > > > There's something broken with the error handling in this > > > > > > > > function - the > > > > > > > > 'err' never goes anywhere - I don't think it ever gets printed > > > > > > > > or > > > > > > > > reported or stops the migration. > > > > > > > > > > > > yes, I'll fix it. > > > > > > > > > > > > > > > + } > > > > > > > > > + if (migration_in_setup(s) && !should_be_hidden && > > > > > > > > > n->primary_dev) { > > > > > > > > > + qdev_unplug(n->primary_dev, &err); > > > > > > > > > > > > > > > > Not knowing unplug well; can you just explain - is that device > > > > > > > > hard > > > > > > > > unplugged and it's gone by the time this function returns or is > > > > > > > > it still > > > > > > > > hanging around for some indeterminate time? > > > > > > > > > > > > Qemu will trigger an unplug request via pcie attention button in > > > > > > which case > > > > > > there could be a delay by the guest operating system. We could give > > > > > > it some > > > > > > amount of time and if nothing happens try surpise removal or handle > > > > > > the > > > > > > error otherwise. > > > > > > > > > > > > > > > > > > regards, > > > > > > Jens > > > > > > > > > > That's a subject for another day. Let's get the basic thing > > > > > working. > > > > > > > > Well no, we need to know this thing isn't going to hang in the migration > > > > setup phase, or if it does how we recover. > > > > > > > > > This thing is *supposed* to be stuck in migration startup phase > > > if guest is malicious. > > > > > > If migration does not progress management needs > > > a way to detect this and cancel. > > > > > > Some more documentation about how this is supposed to happen > > > would be helpful. > > > > I want to see that first; because I want to convinced it's just a > > documentation problem and that we actually really have a method of > > recovering. > > > > > > This patch series is very > > > > odd precisely because it's trying to do the unplug itself in the > > > > migration phase rather than let the management layer do it - so unless > > > > it's nailed down how to make sure that's really really bullet proof > > > > then we've got to go back and ask the question about whether we should > > > > really fix it so it can be done by the management layer. > > > > > > > > Dave > > > > > > management already said they can't because files get closed and > > > resources freed on unplug and so they might not be able to re-add device > > > on migration failure. We do it in migration because that is > > > where failures can happen and we can recover. > > > > I find this explanation confusing - I can kind of see where it's coming > > from, but we've got a pretty clear separation between a NIC and the > > netdev that backs it; those files and resources should be associated > > with the netdev and not the NIC. So does hot-removing the NIC really > > clean up the netdev? (I guess maybe this is a different in vfio > > which is the problem) > > > > Dave > > what we are removing is the VFIO device. > Nothing to do with nic or netdev.
OK, but at the same time why can't we hold open the VFIOs devices resources in a comparable way - i.e. don't really let qemu let go of it even when the guest has unplugged it? Dave > > > > > -- > > > > > MST > > > > -- > > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > -- > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK