On Thu, May 30, 2019 at 07:00:23PM +0100, Dr. David Alan Gilbert wrote: > * Michael S. Tsirkin (m...@redhat.com) wrote: > > On Thu, May 30, 2019 at 04:56:45PM +0200, Jens Freimann wrote: > > > Hi David, > > > > > > sorry for the delayed reply. > > > > > > On Tue, May 28, 2019 at 11:04:15AM -0400, Michael S. Tsirkin wrote: > > > > On Tue, May 21, 2019 at 10:45:05AM +0100, Dr. David Alan Gilbert wrote: > > > > > * Jens Freimann (jfreim...@redhat.com) wrote: > > > > > > +static void virtio_net_primary_plug_timer(void *opaque); > > > > > > + > > > > > > static void virtio_net_set_link_status(NetClientState *nc) > > > > > > { > > > > > > VirtIONet *n = qemu_get_nic_opaque(nc); > > > > > > @@ -786,6 +796,14 @@ static void > > > > > > virtio_net_set_features(VirtIODevice *vdev, uint64_t features) > > > > > > } else { > > > > > > memset(n->vlans, 0xff, MAX_VLAN >> 3); > > > > > > } > > > > > > + > > > > > > + if (virtio_has_feature(features, VIRTIO_NET_F_STANDBY)) { > > > > > > + atomic_set(&n->primary_should_be_hidden, false); > > > > > > + if (n->primary_device_timer) > > > > > > + timer_mod(n->primary_device_timer, > > > > > > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + > > > > > > + 4000); > > > > > > + } > > > > > > > > > > What's this magic timer constant and why? > > > > > > To be honest it's a leftover from previous versions (before I took > > > over) of the patches and I'm not sure why the timer is there. > > > I removed it and so far see no reason to keep it. > > > > > > > > > > > > > > } > > > > > > > > > > > > static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd, > > > > > > @@ -2626,6 +2644,87 @@ void virtio_net_set_netclient_name(VirtIONet > > > > > > *n, const char *name, > > > > > > n->netclient_type = g_strdup(type); > > > > > > } > > > > > > > > > > > > +static void virtio_net_primary_plug_timer(void *opaque) > > > > > > +{ > > > > > > + VirtIONet *n = opaque; > > > > > > + Error *err = NULL; > > > > > > + > > > > > > + if (n->primary_device_dict) > > > > > > + n->primary_device_opts = > > > > > > qemu_opts_from_qdict(qemu_find_opts("device"), > > > > > > + n->primary_device_dict, &err); > > > > > > + if (n->primary_device_opts) { > > > > > > + n->primary_dev = qdev_device_add(n->primary_device_opts, > > > > > > &err); > > > > > > + error_setg(&err, "virtio_net: couldn't plug in primary > > > > > > device"); > > > > > > + return; > > > > > > + } > > > > > > + if (!n->primary_device_dict && err) { > > > > > > + if (n->primary_device_timer) { > > > > > > + timer_mod(n->primary_device_timer, > > > > > > + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + > > > > > > + 100); > > > > > > > > > > same here. > > > > > > see above > > > > > > > > > > > > > > > > > > > + } > > > > > > + } > > > > > > +} > > > > > > + > > > > > > +static void virtio_net_handle_migration_primary(VirtIONet *n, > > > > > > + MigrationState *s) > > > > > > +{ > > > > > > + Error *err = NULL; > > > > > > + bool should_be_hidden = > > > > > > atomic_read(&n->primary_should_be_hidden); > > > > > > + > > > > > > + n->primary_dev = qdev_find_recursive(sysbus_get_default(), > > > > > > + n->primary_device_id); > > > > > > + if (!n->primary_dev) { > > > > > > + error_setg(&err, "virtio_net: couldn't find primary > > > > > > device"); > > > > > > > > > > There's something broken with the error handling in this function - > > > > > the > > > > > 'err' never goes anywhere - I don't think it ever gets printed or > > > > > reported or stops the migration. > > > > > > yes, I'll fix it. > > > > > > > > > + } > > > > > > + if (migration_in_setup(s) && !should_be_hidden && > > > > > > n->primary_dev) { > > > > > > + qdev_unplug(n->primary_dev, &err); > > > > > > > > > > Not knowing unplug well; can you just explain - is that device hard > > > > > unplugged and it's gone by the time this function returns or is it > > > > > still > > > > > hanging around for some indeterminate time? > > > > > > Qemu will trigger an unplug request via pcie attention button in which > > > case > > > there could be a delay by the guest operating system. We could give it > > > some > > > amount of time and if nothing happens try surpise removal or handle the > > > error otherwise. > > > > > > > > > regards, > > > Jens > > > > That's a subject for another day. Let's get the basic thing > > working. > > Well no, we need to know this thing isn't going to hang in the migration > setup phase, or if it does how we recover.
This thing is *supposed* to be stuck in migration startup phase if guest is malicious. If migration does not progress management needs a way to detect this and cancel. Some more documentation about how this is supposed to happen would be helpful. > This patch series is very > odd precisely because it's trying to do the unplug itself in the > migration phase rather than let the management layer do it - so unless > it's nailed down how to make sure that's really really bullet proof > then we've got to go back and ask the question about whether we should > really fix it so it can be done by the management layer. > > Dave management already said they can't because files get closed and resources freed on unplug and so they might not be able to re-add device on migration failure. We do it in migration because that is where failures can happen and we can recover. > > -- > > MST > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK