Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface

Tian, Kevin Sun, 25 Nov 2018 23:14:46 -0800

> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Friday, November 23, 2018 4:02 AM
> 
[...]
> >
> > I looked at the explanations in this patch, but still didn't get the 
> > intention,
> e.g.:
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> > + *   Transition VFIO device in migration setup state. This is used to
> prepare
> > + *   VFIO device for migration while application or VM and vCPUs are still
> in
> > + *   running state.
> >
> > what preparation is actually required? any example?
> 
> Each vendor driver can have different requirements as to how to prepare
> for migration. For example, this phase can be used to allocate buffer
> which can be mapped to MIGRATION region's data part, and allocating
> staging buffer. Driver might need to spawn thread which would start
> collecting data that need to be send during pre-copy phase.
> 
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> > + *   When VFIO user space application or VM is active and vCPUs are
> running,
> > + *   transition VFIO device in pre-copy state.
> >
> > why does device driver need know this stage? in precopy phase, the VM
> > is still running. Just dirty page tracking is in progress. the dirty bitmap
> could
> > be retrieved through its own action interface.
> >
> 
> All mdev devices are not similar. Pre-copy phase is not just about dirty
> page tracking. For devices which have memory on device could transfer
> data from that memory during pre-copy phase. For example, NVIDIA GPU
> has
> its own FB, so need to start sending FB data during pre-copy phase and
> then during stop and copy phase send data from FB which is marked dirty
> after that was copied in pre-copy phase. That helps to reduce total down
> time.


yes it makes sense, otherwise copying whole big FB at stop time is time
consuming. Curious, does Qemu already support pre-copy of device state
today, or is this series the 1st example to do that?

> 
> > you have code to demonstrate how those states are transitioned in Qemu,
> > but you didn't show evidence why those states are necessary in device
> side,
> > which leads to the puzzle whether the definition is over-killed and 
> > limiting.
> >
> 
> I'm trying to keep these interfaces generic for VFIO and mdev devices.
> Its difficult to define what vendor driver should do for each state,
> each vendor driver have their own requirements. Vendor drivers should
> decide whether to take any action on state transition or not.
> 
> > the flow in my mind is like below:
> >
> > 1. an interface to turn on/off dirty page tracking on VFIO device:
> >     * vendor driver can do whatever required to enable device specific
> > dirty page tracking mechanism here
> >     * device state is not changed here. still in running state
> >
> > 2. an interface to get dirty page bitmap
> >
> 
> I don't think there should be on/off interface for dirty page tracking.
> If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
> and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP &&
> device_state <=
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
> started, so return dirty page bitmap in data part of migration region.

dirty page tracking might be useful for other purposes, e.g. if people want
to just draw memory access pattern of a given VM. binding dirty tracking
to migration flow is limiting...

> 
> 
> > 3. an interface to start/stop device activity
> >     * the effect of stop is to stop and drain in-the-fly device activities
> and
> > make device state ready for dump-out. vendor driver can do specific
> preparation
> > here
> 
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as
> I
> mentioned above some vendor driver might have to do preparation before
> pre-copy phase starts.
> 
> >     * the effect of start is to check validity of device state and then
> resume
> > device activities. again, vendor driver can do specific cleanup/preparation
> here
> >
> 
> That is VFIO_DEVICE_STATE_MIGRATION_RESUME.
> 
> Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
> VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup
> all that
> which was allocated/mmapped/started thread during setup phase. This
> can
> be moved to transition to _RUNNING state. So if all agrees these states
> can be removed.
> 
> 
> > 4. an interface to save/restore device state
> >     * should happen when device is stopped
> >     * of course there is still an open how to check state compatibility as
> > Alex pointed earlier
> >
> 
> I hope above explains why other states are required.
> 

yes, above makes the whole picture much clearer. Thanks a lot!

Accordingly I'm thinking about whether below state definition could be
more general and extensible:

_STATE_NONE, indicates initial state
_STATE_RUNNING, indicates normal state
_STATE_STOPPED, indicates that device activities are fully stopped
_STATE_IN_TRACKING, indicates that device state can be r/w by user space.
this state can be ORed to RUNNING or STOPPED.

live migration could be implemented in below flow:

(at src side)
1. RUNNING -> {RUNNING | IN_TRACKING}
        * this switch does vendor specific preparation to make device
state accessible to user space (as covered by MIGRATION_SETUP)
        * vendor driver may let iterative read get incremental changes 
since last read (as covered by MIGRATION_PRECOPY).      *open*, do we 
need an explicit flag to indicate such capability?
        * dirty page bitmap is also made available upon this change

2. (RUNNING | IN_TRACKING) -> (STOPPED | IN_TRACKING)
        * device is stopped thus device state is finalized
        * user space can read full device state, as defined for
MIGRATION_STOPNCOPY

3. (STOPPED | IN_TRACKING) -> (STOPPED)
        * device state tracking and dirty page tracking are cancelled. 
cleanup is done for resources setup in step 1. similar to MIGRATION_
SAVE_COMPLETED

4. STOPPED -> NONE, when device is reset later

(at dest side)

1. NONE -> (STOPPED | IN_TRACKING)
        * prepare device state region so user space can write
        * map to MIGRATION_RESUME
        * open: do we need both NONE and STOPPED, or just STOPPED?
2. (STOPPED | IN_TRACKING) -> STOPPED
        * clean up resources allocated in step 1
        * map to MIGRATION_RESUME_COMPLETED
3. STOPPED -> RUNNING
        * resume the device activities

compare to original definition, I think all important steps are covered:
+enum {
+    VFIO_DEVICE_STATE_NONE,
+    VFIO_DEVICE_STATE_RUNNING,
+    VFIO_DEVICE_STATE_MIGRATION_SETUP,
+    VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
+    VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
+    VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME,
+    VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+    VFIO_DEVICE_STATE_MIGRATION_FAILED,
+    VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};

FAILED is not a device state. It should be indicated in return value of set
state action.

CANCELLED can be achieved any time by clearing IN_TRACKING state.

with this new definition, above states can be also selectively used for
other purposes, e.g.:
 
1. user space can do RUNNING->STOPPED->RUNNING for any control reason,
w/o touching device state at all.

2. if someone wants to draw memory access pattern of a VM, it could
be done by RUNNING->(RUNNING | IN_TRACKING)->RUNNING, by reading
dirty bitmap when IN_TRACKING is active. Device state is ready but not 
accessed here, hope it is not a big burden.

Thoughts?

Thanks
Kevin

Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface

Reply via email to