Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

Steven Sistare via Fri, 31 May 2024 12:33:36 -0700

On 5/30/2024 2:39 PM, Peter Xu wrote:

On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:

On 5/29/2024 3:25 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:

On 5/28/2024 5:44 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.


This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..


The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.


AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?


Alignment has changed, and used_length vs max_length changed when
resizeable ramblocks were introduced.  I have dealt with these issues
while supporting cpr for our internal use, and the learned lesson is to
explicitly communicate the creation-time parameters to new qemu.


Why used_length can change?  I'm looking at ram_mig_ram_block_resized():

     if (!migration_is_idle()) {
         /*
          * Precopy code on the source cannot deal with the size of RAM blocks
          * changing at random points in time - especially after sending the
          * RAM block sizes in the migration stream, they must no longer change.
          * Abort and indicate a proper reason.
          */
         error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
         migration_cancel(err);
         error_free(err);
     }

We sent used_length upfront of a migration during SETUP phase.  Looks like
what you're describing can be something different, though?


I was imprecise.  used_length did not change; it was introduced as being
different than max_length when resizeable ramblocks were introduced.

The max_length is not sent.  It is an implicit property of the implementation,
and can change.  It is the size of the memfd mapping, so we need to know it
and preserve it.

used_length is indeed sent during SETUP.  We could also send max_length
at that time, and store both in the struct ramblock, and *maybe* that would
be safe, but that is more fragile and less future proof than setting both
properties to the correct value when the ramblock struct is created.

And BTW, the ramblock properties are sent using ad-hoc code in setup.
I send them using nice clean vmstate.

Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
alignment?  It's set when ramblock is created IIUC:

     rb->align = mr->align;

When will the alignment change?


The alignment specified by the mr to allocate a new block is an implicit 
property
of the implementation, and has changed before, from one qemu release to another.
Not often, but it did, and could again in the future.  Communicating the 
alignment
from old qemu to new qemu is future proof.

These are not an issue for migration because the ramblock is re-created
and the data copied into the new memory.

Mirror the mr->align field in the RAMBlock to simplify the vmstate.
Preserve the old host address, even though it is immediately discarded,
as it will be needed in the future for CPR with iommufd.  Preserve
guest_memfd, even though CPR does not yet support it, to maintain vmstate
compatibility when it becomes supported.


.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?


This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.


Is guest_memfd the only user so far, then?  If so, would it be possible we
split it as a separate effort on top of the base cpr-exec support?


I don't understand the question.  I am indeed deferring support for guest_memfd
to a future time.  For now, I am adding a blocker, and reserving a field for
it in the preserved ramblock attributes, to avoid adding a subsection later.


I meant I'm thinking whether the new ramblock vmsd may not be required for
the initial implementation.

E.g., IIUC vaddr is required by iommufd, and so far that's not part of the
initial support.

Then I think a major thing is about the fds to be managed that will need to
be shared.  If we put guest_memfd aside, it can be really, mostly, about

VFIO fds.


The block->fd must be preserved.  That is the fd of the memfd_create used
by cpr.

For that, I'm wondering whether you looked into something like
this:

commit da3e04b26fd8d15b344944504d5ffa9c5f20b54b
Author: Zhenzhong Duan <zhenzhong.d...@intel.com>
Date:   Tue Nov 21 16:44:10 2023 +0800

     vfio/pci: Make vfio cdev pre-openable by passing a file handle

I just notice this when I was thinking of a way where it might be possible
to avoid QEMU vfio-pci open the device at all, then I found we have
something like that already..

Then if the mgmt wants, IIUC that fd can be passed down from Libvirt
cleanly to dest qemu in a no-exec context.  Would this work too, and
cleaner / reusing existing infrastructures?


That capability as currently defined would not work for cpr.  The fd is
pre-created, but qemu still calls the kernel to configure it.  cpr skips
all kernel configuration calls.

I think it's nice to always have libvirt managing most, or possible, all
fds that qemu uses, then we don't even need scm_rights.  But I didn't look
deeper into this, just a thought.


One could imagine a solution where the manager extracts internal properties
of vfio, ramblock, etc and passes them as creation time parameters on the
new qemu command line.  And, the manager pre-creates all fd's so they
can be passed to old and new qemu. Lots of code required in qemu and in the
manager, and all implicitly created objects would need to me made explicit.
Yuck. The precreate vmstate approach is much simpler for all.

When thinking about this, I also wonder how cpr-exec handles the limited
environments like cgroups and especially seccomps.  I'm not sure what's the
status of that in most cloud environments, but I think exec() / fork() is
definitely not always on the seccomp whitelist, and I think that's also
another reason why we can think about avoid using them.


Exec must be allowed to use cpr-exec mode.  Fork can remain blocked.   Currently
the qemu sandbox option can block 'spawn', which blocks both exec and fork. I 
have
a patch in my next series that makes this more fine grained, so one or the other
can be blocked. Those unwilling to allow exec can wait for cpr-scm mode :)

- Steve

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

Reply via email to