Re: [PATCH V3 01/16] machine: anon-alloc option

David Hildenbrand Thu, 07 Nov 2024 08:45:18 -0800

On 07.11.24 15:04, Steven Sistare wrote:

On 11/7/2024 8:05 AM, David Hildenbrand wrote:

On 06.11.24 21:59, Steven Sistare wrote:

On 11/6/2024 3:41 PM, Peter Xu wrote:

On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:

On 11/4/2024 4:36 PM, David Hildenbrand wrote:

On 04.11.24 21:56, Steven Sistare wrote:

On 11/4/2024 3:15 PM, David Hildenbrand wrote:

On 04.11.24 20:51, David Hildenbrand wrote:

On 04.11.24 18:38, Steven Sistare wrote:

On 11/4/2024 5:39 AM, David Hildenbrand wrote:

On 01.11.24 14:47, Steve Sistare wrote:

Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This option applies to
memory allocated as a side effect of creating various devices. It does
not apply to memory-backend-objects, whether explicitly specified on
the command line, or implicitly created by the -m command line option.


The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.


A more portable, non-Linux specific variant of this will be using shm,
similar to backends/hostmem-shm.c.

Likely we should be using that instead of memfd, or try hiding the
details. See below.


For this series I would prefer to use memfd and hide the details.  It's a
concise (and well tested) solution albeit linux only.  The code you supply
for posix shm would be a good follow on patch to support other unices.


Unless there is reason to use memfd we should start with the more
generic POSIX variant that is available even on systems without memfd.
Factoring stuff out as I drafted does look quite compelling.

I can help with the rework, and send it out separately, so you can focus
on the "machine toggle" as part of this series.

Of course, if we find out we need the memfd internally instead under
Linux for whatever reason later, we can use that instead.

But IIUC, the main selling point for memfd are additional features
(hugetlb, memory sealing) that you aren't even using.


FWIW, I'm looking into some details, and one difference is that shmem_open() 
under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal 
tmpfs mount. There is not a big difference, but there can be some difference 
(e.g., sizing of the /dev/shm mount).


Sizing is a non-trivial difference.  One can by default allocate all memory 
using memfd_create.
To do so using shm_open requires configuration on the mount.  One step harder 
to use.


Yes.


This is a real issue for memory-backend-ram, and becomes an issue for the 
internal RAM
if memory-backend-ram has hogged all the memory.

Regarding memory-backend-ram,share=on, I assume we can use memfd if available, 
but then fallback to shm_open().


Yes, and if that is a good idea, then the same should be done for internal RAM
-- memfd if available and fallback to shm_open.


Yes.

I'm hoping we can find a way where it just all is rather intuitive, like

"default-ram-share=on": behave for internal RAM just like 
"memory-backend-ram,share=on"

"memory-backend-ram,share=on": use whatever mechanism we have to give us 
"anonymous" memory that can be shared using an fd with another process.

Thoughts?


Agreed, though I thought I had already landed at the intuitive specification in 
my patch.
The user must explicitly configure memory-backend-* to be usable with CPR, and 
anon-alloc
controls everything else.  Now we're just riffing on the details: memfd vs 
shm_open, spelling
of options and words to describe them.


Well, yes, and making it all a bit more consistent and the "machine option" behave just 
like "memory-backend-ram,share=on".


Hi David and Peter,

I have implemented and tested the following, for both qemu_memfd_create
and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
for simplicity.


I'm ok with either shm or memfd, as this feature only applies to Linux
anyway.  I'll leave that part to you and David to decide.


Any comments before I submit a complete patch?

----
qemu-options.hx:
       ``aux-ram-share=on|off``
           Allocate auxiliary guest RAM as an anonymous file that is
           shareable with an external process.  This option applies to
           memory allocated as a side effect of creating various devices.
           It does not apply to memory-backend-objects, whether explicitly
           specified on the command line, or implicitly created by the -m
           command line option.

           Some migration modes require aux-ram-share=on.

qapi/migration.json:
       @cpr-transfer:
            ...
            Memory-backend objects must have the share=on attribute, but
            memory-backend-epc is not supported.  The VM must be started
            with the '-machine aux-ram-share=on' option.

Define RAM_PRIVATE

Define qemu_shm_alloc(), from David's tmp patch

ram_backend_memory_alloc()
       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
       memory_region_init_ram_flags_nomigrate(ram_flags)


Looks all good until here.


qemu_ram_alloc_internal()
       ...
       if (!host && !(ram_flags & RAM_PRIVATE) && 
current_machine->aux_ram_share)


Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
that's equal to RAM_PREALLOC.


IMO testing host is clearer and more future proof, regardless of how flags
are currently used.  If the caller passes host, then we should not allocate
memory here, full stop.

Meanwhile I slightly prefer we don't touch
anything if SHARED|PRIVATE is set.


OK, if SHARED is already set I will not set it again.


We only have to make sure that stuff like qemu_ram_is_shared() will continue 
working as expected.

What I think we should do:

We should probably assert that nobody passes in SHARED|PRIVATE. And we can use 
PRIVATE only as a parameter to the function, but never actually set it on the 
ramblock.

If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED 
remains cleared)

If someone passes in SHARED, we do set it in block->flags.
If someone passes PRIVATE|SHARED, we assert.

If someone passes in nothing: we set block->flags to SHARED with 
aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)

If that's also what you had in mind, great.


Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
I will undo the latter.

Do you plan to submit the part of your "tmp" patch that refactors
shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
I could include it in my series, with your Signed-off-by.


My patch went a bit too far I think. And would not work on win32 :)

We should probably start with this:


From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <da...@redhat.com>
Date: Mon, 4 Nov 2024 11:29:22 +0100
Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous
 shared memory with an fd"

Let's factor it out so we can reuse it.

Signed-off-by: David Hildenbrand <da...@redhat.com>
---
 backends/hostmem-shm.c | 45 ++++-------------------------------
 include/qemu/osdep.h   |  1 +
 system/physmem.c       |  2 +-
 util/oslib-posix.c     | 53 ++++++++++++++++++++++++++++++++++++++++++
 util/oslib-win32.c     |  6 +++++
 5 files changed, 65 insertions(+), 42 deletions(-)

diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
index 374edc3db8..837b9f1dd4 100644
--- a/backends/hostmem-shm.c
+++ b/backends/hostmem-shm.c
@@ -25,11 +25,9 @@ struct HostMemoryBackendShm {
 static bool
 shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
-    g_autoptr(GString) shm_name = g_string_new(NULL);
     g_autofree char *backend_name = NULL;
     uint32_t ram_flags;
-    int fd, oflag;
-    mode_t mode;
+    int fd;

if (!backend->size) {

         error_setg(errp, "can't create shm backend with size 0");
@@ -41,48 +39,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
         return false;
     }

- /*

-     * Let's use `mode = 0` because we don't want other processes to open our
-     * memory unless we share the file descriptor with them.
-     */
-    mode = 0;
-    oflag = O_RDWR | O_CREAT | O_EXCL;
-    backend_name = host_memory_backend_get_name(backend);
-
-    /*
-     * Some operating systems allow creating anonymous POSIX shared memory
-     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
-     * defined by POSIX, so let's create a unique name.
-     *
-     * From Linux's shm_open(3) man-page:
-     *   For  portable  use,  a shared  memory  object should be identified
-     *   by a name of the form /somename;"
-     */
-    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
-                    backend_name);
-
-    fd = shm_open(shm_name->str, oflag, mode);
+    fd = qemu_shm_alloc(backend->size, errp);
     if (fd < 0) {
-        error_setg_errno(errp, errno,
-                         "failed to create POSIX shared memory");
-        return false;
-    }
-
-    /*
-     * We have the file descriptor, so we no longer need to expose the
-     * POSIX shared memory object. However it will remain allocated as long as
-     * there are file descriptors pointing to it.
-     */
-    shm_unlink(shm_name->str);
-
-    if (ftruncate(fd, backend->size) == -1) {
-        error_setg_errno(errp, errno,
-                         "failed to resize POSIX shared memory to %" PRIu64,
-                         backend->size);
-        close(fd);
         return false;
     }

+ /* Let's do the same as memory-backend-ram,share=on would do. */

+    backend_name = host_memory_backend_get_name(backend);
     ram_flags = RAM_SHARED;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h

index fe7c3c5f67..4a24f11174 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -505,6 +505,7 @@ int qemu_daemon(int nochdir, int noclose);
 void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
                           bool noreserve);
 void qemu_anon_ram_free(void *ptr, size_t size);
+int qemu_shm_alloc(size_t size, Error **errp);

#ifdef _WIN32

 #define HAVE_CHARDEV_SERIAL 1
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..1b477fec44 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, 
ram_addr_t max_size,
     new_block->page_size = qemu_real_host_page_size();
     new_block->host = host;
     new_block->flags = ram_flags;
-    ram_block_add(new_block, &local_err);
+
     if (local_err) {
         g_free(new_block);
         error_propagate(errp, local_err);
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..bc5c28b162 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -931,3 +931,56 @@ void qemu_close_all_open_fd(const int *skip, unsigned int 
nskip)
         qemu_close_all_open_fd_fallback(skip, nskip, open_max);
     }
 }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    g_autoptr(GString) shm_name = g_string_new(NULL);
+    int fd, oflag, cur_sequence;
+    static int sequence;
+    mode_t mode;
+
+    cur_sequence = qatomic_fetch_inc(&sequence);
+
+    /*
+     * Let's use `mode = 0` because we don't want other processes to open our
+     * memory unless we share the file descriptor with them.
+     */
+    mode = 0;
+    oflag = O_RDWR | O_CREAT | O_EXCL;
+
+    /*
+     * Some operating systems allow creating anonymous POSIX shared memory
+     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+     * defined by POSIX, so let's create a unique name.
+     *
+     * From Linux's shm_open(3) man-page:
+     *   For  portable  use,  a shared  memory  object should be identified
+     *   by a name of the form /somename;"
+     */
+    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
+                    cur_sequence);
+
+    fd = shm_open(shm_name->str, oflag, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno,
+                         "failed to create POSIX shared memory");
+        return -1;
+    }
+
+    /*
+     * We have the file descriptor, so we no longer need to expose the
+     * POSIX shared memory object. However it will remain allocated as long as
+     * there are file descriptors pointing to it.
+     */
+    shm_unlink(shm_name->str);
+
+    if (ftruncate(fd, size) == -1) {
+        error_setg_errno(errp, errno,
+                         "failed to resize POSIX shared memory to %" PRIu64,
+                         size);
+        close(fd);
+        return -1;
+    }
+
+    return fd;
+}
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b623830d62..f79a190b78 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -877,3 +877,9 @@ void qemu_win32_map_free(void *ptr, HANDLE h, Error **errp)
     }
     CloseHandle(h);
 }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    error_setg("Shared memory is not supported.");
+    return -1;
+}
--
2.47.0


So we can reuse it for the !host && RAM_SHARED case.


Do you have any comments on my proposed name aux-ram-share, or my proposed text


aux-ram-share works for me, I prefer "aux" over the "default" I had in mind.

for qemu-options.hx and migration.json?  Speaking now would prevent more version
churn later.


Both sounds good to me after a quick scan.


--
Cheers,

David / dhildenb

Re: [PATCH V3 01/16] machine: anon-alloc option

Reply via email to