On 11.04.23 17:05, Hanna Czenczek wrote:

[...]

Hanna Czenczek (4):
   vhost: Re-enable vrings after setting features
   vhost-user: Interface for migration state transfer
   vhost: Add high-level state save/load functions
   vhost-user-fs: Implement internal migration

I’m trying to write v2, and my intention was to keep the code conceptually largely the same, but include in the documentation change thoughts and notes on how this interface is to be used in the future, when e.g. vDPA “extensions” come over to vhost-user.  My plan was to, based on that documentation, discuss further.

But now I’m struggling to even write that documentation because it’s not clear to me what exactly the result of the discussion was, so I need to stop even before that.

So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
1. As a signal to the back-end when virt queues are no longer to be processed, so that it is clear that it will not do that when asked for migration state. 2. Stateful devices that support SET_STATUS receive a status of 0 when the VM is stopped, which supposedly resets the internal state. While suspended, device state is frozen, so as far as I understand, SUSPEND before SET_STATUS would have the status change be deferred until RESUME.

I don’t want to hang myself up on 2 because it doesn’t really seem important to this series, but: Why does a status of 0 reset the internal state?  [Note: This is all virtio_reset() seems to do, set the status to 0.]  The vhost-user specification only points to the virtio specification, which doesn’t say anything to that effect. Instead, an explicit device reset is mentioned, which would be VHOST_USER_RESET_DEVICE, i.e. something completely different. Because RESET_DEVICE directly contradicts SUSPEND’s description, I would like to think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.

Is it that a status 0 won’t explicitly reset the internal state, but because it does mean that the driver is unbound, the state should implicitly be reset?

Anyway.  1 seems to be the relevant point for migration.  As far as I understand, currently, a vhost-user back-end has no way of knowing when to stop processing virt queues.  Basically, rings can only transition from stopped to started, but not vice versa.  The vhost-user specification has this bit: “Once the source has finished migration, rings will be stopped by the source. No further update must be done before rings are restarted.”  It just doesn’t say how the front-end lets the back-end know that the rings are (to be) stopped.  So this seems like a pre-existing problem for stateless migration.  Unless this is communicated precisely by setting the device status to 0?

Naturally, what I want to know most of all is whether you believe I can get away without SUSPEND/RESUME for now.  To me, it seems like honestly not really, only when turning two blind eyes, because otherwise we can’t ensure that virtiofsd isn’t still processing pending virt queue requests when the state transfer is begun, even when the guest CPUs are already stopped.  Of course, virtiofsd could stop queue processing right there and then, but…  That feels like a hack that in the grand scheme of things just isn’t necessary when we could “just” introduce SUSPEND/RESUME into vhost-user for exactly this.

Beyond the SUSPEND/RESUME question, I understand everything can stay as-is for now, as the design doesn’t seem to conflict too badly with possible future extensions for other migration phases or more finely grained migration phase control between front-end and back-end.

Did I at least roughly get the gist?

Hanna


Reply via email to