Re: [PATCH V3 01/16] machine: anon-alloc option

Peter Xu Fri, 08 Nov 2024 05:44:55 -0800

On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
> On 07.11.24 17:40, Steven Sistare wrote:
> > On 11/7/2024 11:26 AM, David Hildenbrand wrote:
> > > On 07.11.24 17:02, Steven Sistare wrote:
> > > > On 11/7/2024 8:23 AM, David Hildenbrand wrote:
> > > > > On 06.11.24 21:12, Steven Sistare wrote:
> > > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or 
> > > > > > > > > > > > > memfd_create depending
> > > > > > > > > > > > > on the value of the anon-alloc machine property.  
> > > > > > > > > > > > > This option applies to
> > > > > > > > > > > > > memory allocated as a side effect of creating various 
> > > > > > > > > > > > > devices. It does
> > > > > > > > > > > > > not apply to memory-backend-objects, whether 
> > > > > > > > > > > > > explicitly specified on
> > > > > > > > > > > > > the command line, or implicitly created by the -m 
> > > > > > > > > > > > > command line option.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The memfd option is intended to support new migration 
> > > > > > > > > > > > > modes, in which the
> > > > > > > > > > > > > memory region can be transferred in place to a new 
> > > > > > > > > > > > > QEMU process, by sending
> > > > > > > > > > > > > the memfd file descriptor to the process.  Memory 
> > > > > > > > > > > > > contents are preserved,
> > > > > > > > > > > > > and if the mode also transfers device descriptors, 
> > > > > > > > > > > > > then pages that are
> > > > > > > > > > > > > locked in memory for DMA remain locked.  This 
> > > > > > > > > > > > > behavior is a pre-requisite
> > > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with 
> > > > > > > > > > > > > the new modes.
> > > > > > > > > > > > 
> > > > > > > > > > > > A more portable, non-Linux specific variant of this 
> > > > > > > > > > > > will be using shm,
> > > > > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > > > > 
> > > > > > > > > > > > Likely we should be using that instead of memfd, or try 
> > > > > > > > > > > > hiding the
> > > > > > > > > > > > details. See below.
> > > > > > > > > > > 
> > > > > > > > > > > For this series I would prefer to use memfd and hide the 
> > > > > > > > > > > details.  It's a
> > > > > > > > > > > concise (and well tested) solution albeit linux only.  
> > > > > > > > > > > The code you supply
> > > > > > > > > > > for posix shm would be a good follow on patch to support 
> > > > > > > > > > > other unices.
> > > > > > > > > > 
> > > > > > > > > > Unless there is reason to use memfd we should start with 
> > > > > > > > > > the more
> > > > > > > > > > generic POSIX variant that is available even on systems 
> > > > > > > > > > without memfd.
> > > > > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > > > > 
> > > > > > > > > > I can help with the rework, and send it out separately, so 
> > > > > > > > > > you can focus
> > > > > > > > > > on the "machine toggle" as part of this series.
> > > > > > > > > > 
> > > > > > > > > > Of course, if we find out we need the memfd internally 
> > > > > > > > > > instead under
> > > > > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > > > > 
> > > > > > > > > > But IIUC, the main selling point for memfd are additional 
> > > > > > > > > > features
> > > > > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > > > > 
> > > > > > > > > FWIW, I'm looking into some details, and one difference is 
> > > > > > > > > that shmem_open() under Linux (glibc) seems to go to 
> > > > > > > > > /dev/shmem and memfd/SYSV go to the internal tmpfs mount. 
> > > > > > > > > There is not a big difference, but there can be some 
> > > > > > > > > difference (e.g., sizing of the /dev/shm mount).
> > > > > > > > 
> > > > > > > > Sizing is a non-trivial difference.  One can by default 
> > > > > > > > allocate all memory using memfd_create.
> > > > > > > > To do so using shm_open requires configuration on the mount.  
> > > > > > > > One step harder to use.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > This is a real issue for memory-backend-ram, and becomes an 
> > > > > > > > issue for the internal RAM
> > > > > > > > if memory-backend-ram has hogged all the memory.
> > > > > > > > 
> > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use 
> > > > > > > > > memfd if available, but then fallback to shm_open().
> > > > > > > > 
> > > > > > > > Yes, and if that is a good idea, then the same should be done 
> > > > > > > > for internal RAM
> > > > > > > > -- memfd if available and fallback to shm_open.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > I'm hoping we can find a way where it just all is rather 
> > > > > > > > > intuitive, like
> > > > > > > > > 
> > > > > > > > > "default-ram-share=on": behave for internal RAM just like 
> > > > > > > > > "memory-backend-ram,share=on"
> > > > > > > > > 
> > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have 
> > > > > > > > > to give us "anonymous" memory that can be shared using an fd 
> > > > > > > > > with another process.
> > > > > > > > > 
> > > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > Agreed, though I thought I had already landed at the intuitive 
> > > > > > > > specification in my patch.
> > > > > > > > The user must explicitly configure memory-backend-* to be 
> > > > > > > > usable with CPR, and anon-alloc
> > > > > > > > controls everything else.  Now we're just riffing on the 
> > > > > > > > details: memfd vs shm_open, spelling
> > > > > > > > of options and words to describe them.
> > > > > > > 
> > > > > > > Well, yes, and making it all a bit more consistent and the 
> > > > > > > "machine option" behave just like "memory-backend-ram,share=on".
> > > > > > 
> > > > > > Hi David and Peter,
> > > > > > 
> > > > > > I have implemented and tested the following, for both 
> > > > > > qemu_memfd_create
> > > > > > and qemu_shm_alloc.  This is pseudo-code, with error conditions 
> > > > > > omitted
> > > > > > for simplicity.
> > > > > > 
> > > > > > Any comments before I submit a complete patch?
> > > > > > 
> > > > > > ----
> > > > > > qemu-options.hx:
> > > > > >         ``aux-ram-share=on|off``
> > > > > >             Allocate auxiliary guest RAM as an anonymous file that 
> > > > > > is
> > > > > >             shareable with an external process.  This option 
> > > > > > applies to
> > > > > >             memory allocated as a side effect of creating various 
> > > > > > devices.
> > > > > >             It does not apply to memory-backend-objects, whether 
> > > > > > explicitly
> > > > > >             specified on the command line, or implicitly created by 
> > > > > > the -m
> > > > > >             command line option.
> > > > > > 
> > > > > >             Some migration modes require aux-ram-share=on.
> > > > > > 
> > > > > > qapi/migration.json:
> > > > > >         @cpr-transfer:
> > > > > >              ...
> > > > > >              Memory-backend objects must have the share=on 
> > > > > > attribute, but
> > > > > >              memory-backend-epc is not supported.  The VM must be 
> > > > > > started
> > > > > >              with the '-machine aux-ram-share=on' option.
> > > > > > 
> > > > > > Define RAM_PRIVATE
> > > > > > 
> > > > > > Define qemu_shm_alloc(), from David's tmp patch
> > > > > > 
> > > > > > ram_backend_memory_alloc()
> > > > > >         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > > > > >         memory_region_init_ram_flags_nomigrate(ram_flags)
> > > > > > 
> > > > > > qemu_ram_alloc_internal()
> > > > > >         ...
> > > > > >         if (!host && !(ram_flags & RAM_PRIVATE) && 
> > > > > > current_machine->aux_ram_share)
> > > > > >             new_block->flags |= RAM_SHARED;
> > > > > > 
> > > > > >         if (!host && (new_block->flags & RAM_SHARED)) {
> > > > > >             qemu_ram_alloc_shared(new_block);
> > > > > >         } else
> > > > > >             new_block->fd = -1;
> > > > > >             new_block->host = host;
> > > > > >         }
> > > > > >         ram_block_add(new_block);
> > > > > > 
> > > > > > qemu_ram_alloc_shared()
> > > > > >         if qemu_memfd_check()
> > > > > >             new_block->fd = qemu_memfd_create()
> > > > > >         else
> > > > > >             new_block->fd = qemu_shm_alloc()
> > > > > 
> > > > > Yes, that way "memory-backend-ram,share=on" will just mean "give me 
> > > > > the best shared memory for RAM to be shared with other processes, I 
> > > > > don't care about the details", and it will work on Linux kernels even 
> > > > > before we had memfds.
> > > > > 
> > > > > memory-backend-ram should be available on all architectures, and 
> > > > > under Windows. qemu_anon_ram_alloc() under Linux just does nothing 
> > > > > special, not even bail out.
> > > > > 
> > > > > MAP_SHARED|MAP_ANON was always weird, because it meant "give me 
> > > > > memory I can share only with subprocesses", but then, *there are not 
> > > > > subprocesses for QEMU*. I recall there was a trick to obtain the fd 
> > > > > under Linux for these regions using /proc/self/fd/, but it's very 
> > > > > Linux specific ...
> > > > > 
> > > > > So nobody would *actually* use that shared memory and it was only a 
> > > > > hack for RDMA. Now we can do better.
> > > > > 
> > > > > 
> > > > > We'll have to decide if we simply fallback to qemu_anon_ram_alloc() 
> > > > > if no shared memory can be created (unavailable), like we do on 
> > > > > Windows.
> > > > > 
> > > > > So maybe something like
> > > > > 
> > > > > qemu_ram_alloc_shared()
> > > > >        fd = -1;
> > > > > 
> > > > >        if (qemu_memfd_avilable()) {
> > > > >            fd = qemu_memfd_create();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else if (qemu_shm_available())
> > > > >            fd = qemu_shm_alloc();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else {
> > > > >            /*
> > > > >             * Old behavior: try fd-less shared memory. We might
> > > > >             * just end up with non-shared memory on Windows, but
> > > > >             * nobody can make sure of this shared memory either way
> > > > >             * ... should we just use non-shared memory? Or should
> > > > >             * we simply bail out? But then, if there is no shared
> > > > >             * memory nobody could possible use it.
> > > > >             */
> > > > >            qemu_anon_ram_alloc(share=true)
> > > > >        }
> > > > 
> > > > Good catch.  We need that fallback for backwards compatibility.  Even 
> > > > with
> > > > no use case for memory-backend-ram,share=on since the demise of rdma, 
> > > > users
> > > > may specify it on windows, for no particular reason, but it works, and 
> > > > should
> > > > continue to work after this series.  CPR would be blocked.
> > > 
> > > Yes, we should keep Windows working in the weird way it is working right 
> > > now.
> > > 
> > >   > > More generally for backwards compatibility for share=on for no 
> > > particular reason,
> > > > should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted 
> > > > with default
> > > > options and more than half of ram is requested, it will fail, whereas 
> > > > current qemu
> > > > succeeds using MAP_SHARED|MAP_ANON.
> > > 
> > > Only on Linux without memfd, of course. Maybe we should just warn when 
> > > qemu_shm_alloc() fails (and comment that we continue for compat reasons 
> > > only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We 
> > > could implement a fallback to shmget() but ... let's not go down that 
> > > path.
> > > 
> > > But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if 
> > > memfd is available and that allocating the memfd failed. Failing to 
> > > allocate a memfd might highlight a bigger problem.
> > 
> > Agreed on all.
> > 
> > One more opinion from you please, if you will.
> > 
> > RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
> > set in
> >     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
> > 
> > None of the other backends reach qemu_ram_alloc_internal.
> > 
> > To be future proof, do you prefer I also set MAP_PRIVATE in the other 
> > backends,
> > everywhere MAP_SHARED may be set, eg:
> 
> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
> it and relied on !RAM_SHARED doing the right thing.
> 
> Alternatively, we make our life easier and do something like
> 
> /*
>  * This flag is only used while creating+allocating RAM, and
>  * prevents RAM_SHARED getting set for anonymous RAM automatically in
>  * some configurations.
>  *
>  * By default, not setting RAM_SHARED on anonymous RAM implies
>  * "private anonymous RAM"; however, in some configuration we want to
>  * have most of this RAM automatically be "sharable anonymous RAM",
>  * except for some cases that really want "private anonymous RAM".
>  *
>  * This anonymous RAM *must* be private. This flag only applies to
>  * "anonymous" RAM, not fd/file-backed/preallocated one.
>  */
> RAM_FORCE_ANON_PRIVATE        (1 << 13)
> 
> 
> BUT maybe an even better alternative now that we have the "aux-ram-share"
> parameter, could we use
> 
> /*
>  * Auxiliary RAM that was created automatically internally, instead of
>  * explicitly like using memory-backend-ram or some other device on the
>  * QEMU cmdline.
>  */
> RAM_AUX       (1 << 13)
> 
> 
> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
> RAMBlocks.
> 
> That actually looks quite compelling to me :)


Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
place?

IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
easier when we already have SHARED, adding PRIVATE could be mostly natural,
then we can already avoid AUX due to checking !SHARED & !PRIVATE.

Basically, SHARED|PRIVATE then must come from an user request (QMP or
cmdline), otherwise the caller should always set none of them, implying
aux.

It still looks the best to me.

Thanks,

-- 
Peter Xu

Re: [PATCH V3 01/16] machine: anon-alloc option

Reply via email to