> -----Original Message----- > From: Stefan Hajnoczi <stefa...@redhat.com> > Sent: 28 January 2022 08:29 > To: Jag Raman <jag.ra...@oracle.com> > Cc: John Levon <john.le...@nutanix.com>; Thanos Makatos > <thanos.maka...@nutanix.com>; qemu-devel <qemu-devel@nongnu.org>; > Marc-André Lureau <marcandre.lur...@gmail.com>; Philippe Mathieu-Daudé > <f4...@amsat.org>; Paolo Bonzini <pbonz...@redhat.com>; Beraldo Leal > <bl...@redhat.com>; Daniel P. Berrangé <berra...@redhat.com>; > edua...@habkost.net; Michael S. Tsirkin <m...@redhat.com>; Marcel > Apfelbaum <marcel.apfelb...@gmail.com>; Eric Blake <ebl...@redhat.com>; > Markus Armbruster <arm...@redhat.com>; Juan Quintela > <quint...@redhat.com>; Dr . David Alan Gilbert <dgilb...@redhat.com>; Elena > Ufimtseva <elena.ufimts...@oracle.com>; John Johnson > <john.g.john...@oracle.com> > Subject: Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate > migration > > On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote: > > > > > > > On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefa...@redhat.com> > wrote: > > > > > > On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote: > > >> + * The client subsequetly asks the remote server for any data that > > > > > > subsequently > > > > > >> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx) > > >> +{ > > >> + VfuObject *o = vfu_get_private(vfu_ctx); > > >> + VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o)); > > >> + static int migrated_devs; > > >> + Error *local_err = NULL; > > >> + int ret; > > >> + > > >> + /** > > >> + * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the > > >> + * VMSD data from source is not available at RESUME state. > > >> + * Working on a fix for this. > > >> + */ > > >> + if (!o->vfu_mig_file) { > > >> + o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false); > > >> + } > > >> + > > >> + ret = qemu_remote_loadvm(o->vfu_mig_file); > > >> + if (ret) { > > >> + VFU_OBJECT_ERROR(o, "vfu: failed to restore device state"); > > >> + return; > > >> + } > > >> + > > >> + qemu_file_shutdown(o->vfu_mig_file); > > >> + o->vfu_mig_file = NULL; > > >> + > > >> + /* VFU_MIGR_STATE_RUNNING begins here */ > > >> + if (++migrated_devs == k->nr_devs) { > > > > > > When is this counter reset so migration can be tried again if it > > > fails/cancels? > > > > Detecting cancellation is a pending item. We will address it in the > > next rev. Will check with you if we get stuck during the process > > of implementing it. > > > > > > > >> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf, > > >> + uint64_t size, uint64_t offset) > > >> +{ > > >> + VfuObject *o = vfu_get_private(vfu_ctx); > > >> + > > >> + if (offset > o->vfu_mig_buf_size) { > > >> + return -1; > > >> + } > > >> + > > >> + if ((offset + size) > o->vfu_mig_buf_size) { > > >> + warn_report("vfu: buffer overflow - check pending_bytes"); > > >> + size = o->vfu_mig_buf_size - offset; > > >> + } > > >> + > > >> + memcpy(buf, (o->vfu_mig_buf + offset), size); > > >> + > > >> + o->vfu_mig_buf_pending -= size; > > > > > > This assumes that the caller increments offset by size each time. If > > > that assumption is okay, then we can just trust offset and don't need to > > > do arithmetic on vfu_mig_buf_pending. If that assumption is not correct, > > > then the code needs to be extended to safely update vfu_mig_buf_pending > > > when offset jumps around arbitrarily between calls. > > > > Going by the definition of vfu_migration_callbacks_t in the library, I > > assumed > > that read_data advances the offset by size bytes. > > > > Will add a comment a comment to explain that.
libvfio-user does not automatically increment offset by size each time, since the vfio-user client can re-read the migration data multiple times. In libvfio-user API we state: Function that is called to read migration data. offset and size can be any subrange on the offset and size previously returned by prepare_data. Reading the pending_bytes register is what marks the end of the iteration, and this is where you need to decrement vfu_mig_buf_pending. I'll add more unit tests to libvfio-user to validate this behavior. > > > > > > > >> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev) > > >> +{ > > >> + DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev)); > > >> + const VMStateField *field = NULL; > > >> + uint64_t size = 0; > > >> + > > >> + if (!dc->vmsd) { > > >> + return 0; > > >> + } > > >> + > > >> + field = dc->vmsd->fields; > > >> + while (field && field->name) { > > >> + size += vmstate_size(pci_dev, field); > > >> + field++; > > >> + } > > >> + > > >> + return size; > > >> +} > > > > > > This function looks incorrect because it ignores subsections as well as > > > runtime behavior during save(). Although VMStateDescription is partially > > > declarative, there is still a bunch of imperative code that can write to > > > the QEMUFile at save() time so there's no way of knowing the size ahead > > > of time. > > > > I see your point, it would be a problem for any field which has the > > (VMS_BUFFER | VMS_ALLOC) flags set. > > > > > > > > I asked this in a previous revision of this series but I'm not sure if > > > it was answered: is it really necessary to know the size of the vmstate? > > > I thought the VFIO migration interface is designed to support > > > streaming reads/writes. We could choose a fixed size like 64KB and > > > stream the vmstate in 64KB chunks. > > > > The library exposes the migration data to the client as a device BAR with > > fixed size - the size of which is fixed at boot time, even when using > > vfu_migration_callbacks_t callbacks. > > > > I don’t believe the library supports streaming vmstate/migration-data - see > > the following comment in migration_region_access() defined in the library: > > > > * Does this mean that partial reads are not allowed? > > > > Thanos or John, > > > > Could you please clarify this? libvfio-user does support streaming of migration data, this comment is based on the VFIO documentation: d. Read data_size bytes of data from (region + data_offset) from the migration region. It's not clear to me whether streaming should be allowed, I'd be surprised if it didn't. > > > > Stefan, > > We attempted to answer the migration cancellation and vmstate size > > questions previously also, in the following email: > > > > https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71- > 2fcafc0e6...@oracle.com/ > > > libvfio-user has the vfu_migration_callbacks_t interface that allows the > > device to save/load more data regardless of the size of the migration > > region. I don't see the issue here since the region doesn't need to be > > sized to fit the savevm data? > > The answer didn't make sense to me: > > "In both scenarios at the server end - whether using the migration BAR or > using callbacks, the migration data is transported to the other end using > the BAR. As such we need to specify the BAR’s size during initialization. > > In the case of the callbacks, the library translates the BAR access to > callbacks." > > The BAR and the migration region within it need a size but my > understanding is that VFIO migration is designed to stream the device > state, allowing it to be broken up into multiple reads/writes with > knowing the device state's size upfront. Here is the description from > <linux/vfio.h>: > > * The sequence to be followed while in pre-copy state and stop-and-copy > state > * is as follows: > * a. Read pending_bytes, indicating the start of a new iteration to get > device > * data. Repeated read on pending_bytes at this stage should have no side > * effects. > * If pending_bytes == 0, the user application should not iterate to get > data > * for that device. > * If pending_bytes > 0, perform the following steps. > * b. Read data_offset, indicating that the vendor driver should make data > * available through the data section. The vendor driver should return > this > * read operation only after data is available from (region + data_offset) > * to (region + data_offset + data_size). > * c. Read data_size, which is the amount of data in bytes available through > * the migration region. > * Read on data_offset and data_size should return the offset and size of > * the current buffer if the user application reads data_offset and > * data_size more than once here. > * d. Read data_size bytes of data from (region + data_offset) from the > * migration region. > * e. Process the data. > * f. Read pending_bytes, which indicates that the data from the previous > * iteration has been read. If pending_bytes > 0, go to step b. > * > * The user application can transition from the _SAVING|_RUNNING > * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the > * number of pending bytes. The user application should iterate in _SAVING > * (stop-and-copy) until pending_bytes is 0. > > This means you can report pending_bytes > 0 until the entire vmstate has > been read and can pick a fixed chunk size like 64KB for the migration > region. There's no need to size the migration region to fit the entire > vmstate. > > Stefan