As you correctly point out, this code needs to be looked at more carefully so that if the device does disconnect in the background we can handle the migration path gracefully. In particular, we need to decide whether a migration should be allowed to continue if a device disconnects durning the migration stage.
mst, any thoughts? Have you looked at the suggestion I gave Li Feng to move vhost_dev_cleanup() into the connection path in vhost-user-blk? I’m not sure if he’s actively working on it, but I would prefer if we can find a way to keep some state around between reconnects so we aren’t constantly checking dev->started. A device can be stopped for reasons other than backend disconnect so I’d rather not reuse this field to check for backend disconnect failures. On Thu, Apr 30, 2020 at 9:57 AM Dima Stepanov <dimas...@yandex-team.ru> wrote: > > If vhost-user daemon is used as a backend for the vhost device, then we > should consider a possibility of disconnect at any moment. If such > disconnect happened in the vhost_migration_log() routine the vhost > device structure will be clean up. > At the start of the vhost_migration_log() function there is a check: > if (!dev->started) { > dev->log_enabled = enable; > return 0; > } > To be consistent with this check add the same check after calling the > vhost_dev_set_log() routine. This in general help not to break a Could you point to the specific asserts which are being triggered? > migration due the assert() message. But it looks like that this code > should be revised to handle these errors more carefully. > > In case of vhost-user device backend the fail paths should consider the > state of the device. In this case we should skip some function calls > during rollback on the error paths, so not to get the NULL dereference > errors. > > Signed-off-by: Dima Stepanov <dimas...@yandex-team.ru> > --- > hw/virtio/vhost.c | 39 +++++++++++++++++++++++++++++++++++---- > 1 file changed, 35 insertions(+), 4 deletions(-) > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 3ee50c4..d5ab96d 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev *dev, > static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log) > { > int r, i, idx; A couple points here (1) This will fail the live migration if the device is disconnected. That my be the right thing to do, but if there are cases where migrations can proceed with a disconnected device, this may not be desirable. (2) This looks racy. As far as I can tell vhost_dev_set_log() is only called by vhost_migration_log(), and as you say one of the first things vhost_migration_log does is return if dev->started is not set. What’s to stop a disconnect from clearing the vdev right after this check, just before vhost_dev_set_features() is called? As stated above, I would prefer it if we could add some state which would persist between reconnects which could then be checked in the vhost-user code before interacting with the backend. I understand this will be a much more involved change and will require a lot of thought. Also, regarding (1) above, if the original check in vhost_migration_log() returns success if the device is not started why return an error here? I imagine this could lead to some inconsistent behavior if the device disconnects before the first check verses before the second. > + > + if (!dev->started) { > + /* > + * If vhost-user daemon is used as a backend for the > + * device and the connection is broken, then the vhost_dev > + * structure will be reset all its values to 0. > + * Add additional check for the device state. > + */ > + return -1; > + } > + > r = vhost_dev_set_features(dev, enable_log); > if (r < 0) { > goto err_features; > @@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, > bool enable_log) > }