On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote:
> On 03.10.24 18:14, Peter Xu wrote:
> > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote:
> > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > on the value of the anon-alloc machine property.  This option applies to
> > > memory allocated as a side effect of creating various devices. It does
> > > not apply to memory-backend-objects, whether explicitly specified on
> > > the command line, or implicitly created by the -m command line option.
> > > 
> > > The memfd option is intended to support new migration modes, in which the
> > > memory region can be transferred in place to a new QEMU process, by 
> > > sending
> > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > and if the mode also transfers device descriptors, then pages that are
> > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > 
> > > Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
> > 
> > [Igor seems missing in the loop; added]
> > 
> > > ---
> > >   hw/core/machine.c   | 19 +++++++++++++++++++
> > >   include/hw/boards.h |  1 +
> > >   qapi/machine.json   | 14 ++++++++++++++
> > >   qemu-options.hx     | 11 +++++++++++
> > >   system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
> > >   system/trace-events |  3 +++
> > >   6 files changed, 83 insertions(+)
> > > 
> > > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > > index adaba17..a89a32b 100644
> > > --- a/hw/core/machine.c
> > > +++ b/hw/core/machine.c
> > > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool 
> > > value, Error **errp)
> > >       ms->mem_merge = value;
> > >   }
> > > +static int machine_get_anon_alloc(Object *obj, Error **errp)
> > > +{
> > > +    MachineState *ms = MACHINE(obj);
> > > +
> > > +    return ms->anon_alloc;
> > > +}
> > > +
> > > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> > > +{
> > > +    MachineState *ms = MACHINE(obj);
> > > +
> > > +    ms->anon_alloc = value;
> > > +}
> > > +
> > >   static bool machine_get_usb(Object *obj, Error **errp)
> > >   {
> > >       MachineState *ms = MACHINE(obj);
> > > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, 
> > > void *data)
> > >       object_class_property_set_description(oc, "mem-merge",
> > >           "Enable/disable memory merge support");
> > > +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> > > +                                   &AnonAllocOption_lookup,
> > > +                                   machine_get_anon_alloc,
> > > +                                   machine_set_anon_alloc);
> > > +
> > >       object_class_property_add_bool(oc, "usb",
> > >           machine_get_usb, machine_set_usb);
> > >       object_class_property_set_description(oc, "usb",
> > > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > > index 5966069..5a87647 100644
> > > --- a/include/hw/boards.h
> > > +++ b/include/hw/boards.h
> > > @@ -393,6 +393,7 @@ struct MachineState {
> > >       bool enable_graphics;
> > >       ConfidentialGuestSupport *cgs;
> > >       HostMemoryBackend *memdev;
> > > +    AnonAllocOption anon_alloc;
> > >       /*
> > >        * convenience alias to ram_memdev_id backend memory region
> > >        * or to numa container memory region
> > > diff --git a/qapi/machine.json b/qapi/machine.json
> > > index a6b8795..d4a63f5 100644
> > > --- a/qapi/machine.json
> > > +++ b/qapi/machine.json
> > > @@ -1898,3 +1898,17 @@
> > >   { 'command': 'x-query-interrupt-controllers',
> > >     'returns': 'HumanReadableText',
> > >     'features': [ 'unstable' ]}
> > > +
> > > +##
> > > +# @AnonAllocOption:
> > > +#
> > > +# An enumeration of the options for allocating anonymous guest memory.
> > > +#
> > > +# @mmap: allocate using mmap MAP_ANON
> > > +#
> > > +# @memfd: allocate using memfd_create
> > > +#
> > > +# Since: 9.2
> > > +##
> > > +{ 'enum': 'AnonAllocOption',
> > > +  'data': [ 'mmap', 'memfd' ] }
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index d94e2cb..90ab943 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> > >       "                nvdimm=on|off controls NVDIMM support 
> > > (default=off)\n"
> > >       "                memory-encryption=@var{} memory encryption object 
> > > to use (default=none)\n"
> > >       "                hmat=on|off controls ACPI HMAT support 
> > > (default=off)\n"
> > > +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM 
> > > using mmap MAP_ANON or memfd_create (default: mmap)\n"
> > >       "                memory-backend='backend-id' specifies explicitly 
> > > provided backend for main RAM (default=none)\n"
> > >       "                
> > > cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
> > >       QEMU_ARCH_ALL)
> > > @@ -101,6 +102,16 @@ SRST
> > >           Enables or disables ACPI Heterogeneous Memory Attribute Table
> > >           (HMAT) support. The default is off.
> > > +    ``anon-alloc=mmap|memfd``
> > > +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> > > +        or memfd_create.  This option applies to memory allocated as a
> > > +        side effect of creating various devices. It does not apply to
> > > +        memory-backend-objects, whether explicitly specified on the
> > > +        command line, or implicitly created by the -m command line
> > > +        option.
> > > +
> > > +        Some migration modes require anon-alloc=memfd.
> > > +
> > >       ``memory-backend='id'``
> > >           An alternative to legacy ``-mem-path`` and ``mem-prealloc`` 
> > > options.
> > >           Allows to use a memory backend as main RAM.
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index dc1db3a..174f7e0 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -47,6 +47,7 @@
> > >   #include "qemu/qemu-print.h"
> > >   #include "qemu/log.h"
> > >   #include "qemu/memalign.h"
> > > +#include "qemu/memfd.h"
> > >   #include "exec/memory.h"
> > >   #include "exec/ioport.h"
> > >   #include "sysemu/dma.h"
> > > @@ -69,6 +70,8 @@
> > >   #include "qemu/pmem.h"
> > > +#include "qapi/qapi-types-migration.h"
> > > +#include "migration/options.h"
> > >   #include "migration/vmstate.h"
> > >   #include "qemu/range.h"
> > > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, 
> > > Error **errp)
> > >                   qemu_mutex_unlock_ramlist();
> > >                   return;
> > >               }
> > > +
> > > +        } else if (current_machine->anon_alloc == 
> > > ANON_ALLOC_OPTION_MEMFD &&
> > > +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > +                                        TYPE_MEMORY_BACKEND)) {
> > 
> > This is pretty fragile.. if someone adds yet another layer on top of memory
> > backend objects, the ownership links can change and this might silently run
> > into something else even without any warning..
> > 
> > I wished we dig into what is missing, but maybe that's too trivial.  If
> > not, we still need to make this as solid.  Perhaps that can be a ram flag
> > and let relevant callers pass in that flag explicitly.
> 
> How would they decide whether or not we want to set the flag in the current
> configuration?

It was in the previous email where it got cut..  I listed four paths that
may need change.

> 
> > 
> > I think RAM_SHARED can actually be that flag already - I mean, in all paths
> > that we may create anon mem (but not memory-backend-* objects), is it
> > always safe we always switch to RAM_SHARED from anon?
> 
> Do you mean only setting the flag (-> anonymous shmem) or switching also to
> memfd, which is a bigger change?

Switching to memfd.  I thought anon shmem (mmap(MAP_SHARED)) is mostly the
same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no?

In this case we need the fds so we need to do the latter to cache the fds.

Thanks,

-- 
Peter Xu


Reply via email to