> From: Kirti Wankhede [mailto:kwankh...@nvidia.com] > Sent: Friday, November 23, 2018 4:02 AM > [...] > > > > I looked at the explanations in this patch, but still didn't get the > > intention, > e.g.: > > > > + * - VFIO_DEVICE_STATE_MIGRATION_SETUP: > > + * Transition VFIO device in migration setup state. This is used to > prepare > > + * VFIO device for migration while application or VM and vCPUs are still > in > > + * running state. > > > > what preparation is actually required? any example? > > Each vendor driver can have different requirements as to how to prepare > for migration. For example, this phase can be used to allocate buffer > which can be mapped to MIGRATION region's data part, and allocating > staging buffer. Driver might need to spawn thread which would start > collecting data that need to be send during pre-copy phase. > > > > > + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY: > > + * When VFIO user space application or VM is active and vCPUs are > running, > > + * transition VFIO device in pre-copy state. > > > > why does device driver need know this stage? in precopy phase, the VM > > is still running. Just dirty page tracking is in progress. the dirty bitmap > could > > be retrieved through its own action interface. > > > > All mdev devices are not similar. Pre-copy phase is not just about dirty > page tracking. For devices which have memory on device could transfer > data from that memory during pre-copy phase. For example, NVIDIA GPU > has > its own FB, so need to start sending FB data during pre-copy phase and > then during stop and copy phase send data from FB which is marked dirty > after that was copied in pre-copy phase. That helps to reduce total down > time.
yes it makes sense, otherwise copying whole big FB at stop time is time consuming. Curious, does Qemu already support pre-copy of device state today, or is this series the 1st example to do that? > > > you have code to demonstrate how those states are transitioned in Qemu, > > but you didn't show evidence why those states are necessary in device > side, > > which leads to the puzzle whether the definition is over-killed and > > limiting. > > > > I'm trying to keep these interfaces generic for VFIO and mdev devices. > Its difficult to define what vendor driver should do for each state, > each vendor driver have their own requirements. Vendor drivers should > decide whether to take any action on state transition or not. > > > the flow in my mind is like below: > > > > 1. an interface to turn on/off dirty page tracking on VFIO device: > > * vendor driver can do whatever required to enable device specific > > dirty page tracking mechanism here > > * device state is not changed here. still in running state > > > > 2. an interface to get dirty page bitmap > > > > I don't think there should be on/off interface for dirty page tracking. > If there is a write access on dirty_pfns.start_addr and dirty_pfns.total > and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP && > device_state <= > VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has > started, so return dirty page bitmap in data part of migration region. dirty page tracking might be useful for other purposes, e.g. if people want to just draw memory access pattern of a given VM. binding dirty tracking to migration flow is limiting... > > > > 3. an interface to start/stop device activity > > * the effect of stop is to stop and drain in-the-fly device activities > and > > make device state ready for dump-out. vendor driver can do specific > preparation > > here > > VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as > I > mentioned above some vendor driver might have to do preparation before > pre-copy phase starts. > > > * the effect of start is to check validity of device state and then > resume > > device activities. again, vendor driver can do specific cleanup/preparation > here > > > > That is VFIO_DEVICE_STATE_MIGRATION_RESUME. > > Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and > VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup > all that > which was allocated/mmapped/started thread during setup phase. This > can > be moved to transition to _RUNNING state. So if all agrees these states > can be removed. > > > > 4. an interface to save/restore device state > > * should happen when device is stopped > > * of course there is still an open how to check state compatibility as > > Alex pointed earlier > > > > I hope above explains why other states are required. > yes, above makes the whole picture much clearer. Thanks a lot! Accordingly I'm thinking about whether below state definition could be more general and extensible: _STATE_NONE, indicates initial state _STATE_RUNNING, indicates normal state _STATE_STOPPED, indicates that device activities are fully stopped _STATE_IN_TRACKING, indicates that device state can be r/w by user space. this state can be ORed to RUNNING or STOPPED. live migration could be implemented in below flow: (at src side) 1. RUNNING -> {RUNNING | IN_TRACKING} * this switch does vendor specific preparation to make device state accessible to user space (as covered by MIGRATION_SETUP) * vendor driver may let iterative read get incremental changes since last read (as covered by MIGRATION_PRECOPY). *open*, do we need an explicit flag to indicate such capability? * dirty page bitmap is also made available upon this change 2. (RUNNING | IN_TRACKING) -> (STOPPED | IN_TRACKING) * device is stopped thus device state is finalized * user space can read full device state, as defined for MIGRATION_STOPNCOPY 3. (STOPPED | IN_TRACKING) -> (STOPPED) * device state tracking and dirty page tracking are cancelled. cleanup is done for resources setup in step 1. similar to MIGRATION_ SAVE_COMPLETED 4. STOPPED -> NONE, when device is reset later (at dest side) 1. NONE -> (STOPPED | IN_TRACKING) * prepare device state region so user space can write * map to MIGRATION_RESUME * open: do we need both NONE and STOPPED, or just STOPPED? 2. (STOPPED | IN_TRACKING) -> STOPPED * clean up resources allocated in step 1 * map to MIGRATION_RESUME_COMPLETED 3. STOPPED -> RUNNING * resume the device activities compare to original definition, I think all important steps are covered: +enum { + VFIO_DEVICE_STATE_NONE, + VFIO_DEVICE_STATE_RUNNING, + VFIO_DEVICE_STATE_MIGRATION_SETUP, + VFIO_DEVICE_STATE_MIGRATION_PRECOPY, + VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY, + VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED, + VFIO_DEVICE_STATE_MIGRATION_RESUME, + VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED, + VFIO_DEVICE_STATE_MIGRATION_FAILED, + VFIO_DEVICE_STATE_MIGRATION_CANCELLED, +}; FAILED is not a device state. It should be indicated in return value of set state action. CANCELLED can be achieved any time by clearing IN_TRACKING state. with this new definition, above states can be also selectively used for other purposes, e.g.: 1. user space can do RUNNING->STOPPED->RUNNING for any control reason, w/o touching device state at all. 2. if someone wants to draw memory access pattern of a VM, it could be done by RUNNING->(RUNNING | IN_TRACKING)->RUNNING, by reading dirty bitmap when IN_TRACKING is active. Device state is ready but not accessed here, hope it is not a big burden. Thoughts? Thanks Kevin