On 13.05.25 22:11, Michael S. Tsirkin wrote:
On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote:
On 12.05.25 17:16, Chaney, Ben wrote:
Hello,
When live migrating to a destination host with pmem there is a very
long downtime where the guest is paused. In some cases, this can be as high as
5 minutes, compared to less than one second in the good case.
Profiling suggests very high activity in this code path:
ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
6873c7 xbzrle_load_cleanup+0x37 (inlined)
6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ff375 qemu_loadvm_state_cleanup+0x55
(/usr/local/akamai/qemu/bin/qemu-system-x86_64)
500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ecf85 process_incoming_migration_co+0x95
(/usr/local/akamai/qemu/bin/qemu-system-x86_64)
8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
ffffffffffffffff [unknown] ([unknown])
I was able to resolve the performance issue by removing the call to
qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to
return to normal. It looks like this code path was initially added to ensure
the memory was synchronized if the persistent memory region is backed by an
NVDIMM device. Does it serve any purpose if pmem is instead backed by standard
DRAM?
Are you using a read-only NVDIMM?
In that case, I assume we would never need msync.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 94bb3ccbe4..819b8ef829 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start,
ram_addr_t length);
/* Clear whole block of mem */
static inline void qemu_ram_block_writeback(RAMBlock *block)
{
- qemu_ram_msync(block, 0, block->used_length);
+ if (!(block->flags & RAM_READONLY))
+ qemu_ram_msync(block, 0, block->used_length);
}
--
Cheers,
David / dhildenb
I acked the original change but now I don't understand why is it
critical to preserve memory at a random time that has nothing
to do with guest state.
David, maybe you understand?
Let me dig ...
As you said, we originally added pmem_persist() in:
commit 56eb90af39abf66c0e80588a9f50c31e7df7320b (mst/mst-next)
Author: Junyan He <junyan...@intel.com>
Date: Wed Jul 18 15:48:03 2018 +0800
migration/ram: ensure write persistence on loading all data to PMEM.
Because we need to make sure the pmem kind memory data is synced
after migration, we choose to call pmem_persist() when the migration
finish. This will make sure the data of pmem is safe and will not
lose if power is off.
Signed-off-by: Junyan He <junyan...@intel.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Igor Mammedov <imamm...@redhat.com>
Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Then, we generalized to not using pmem_persist() but doing a
qemu_ram_block_writeback() -- that includes a conditional pmem_persist() in:
commit bd108a44bc29cb648dd930564996b0128e66ac01
Author: Beata Michalska <beata.michal...@linaro.org>
Date: Thu Nov 21 00:08:42 2019 +0000
migration: ram: Switch to ram block writeback
Switch to ram block writeback for pmem migration.
Signed-off-by: Beata Michalska <beata.michal...@linaro.org>
Reviewed-by: Richard Henderson <richard.hender...@linaro.org>
Reviewed-by: Alex Bennée <alex.ben...@linaro.org>
Acked-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
Message-id: 20191121000843.24844-4-beata.michal...@linaro.org
Signed-off-by: Peter Maydell <peter.mayd...@linaro.org>
That was part of a patch series "[PATCH 0/4] target/arm: Support for
Data Cache Clean up to PoP" [1].
At first, it looks like a cleanup, but has the effect of also affecting
non-pmem memory backends.
A discussion [2] includes some reasoning around libpmem not being
around, and msync being a suitable replacement in that case [3]: "
According to the PMDG man page, pmem_persist is supposed to be
equivalent for the msync. It's just more performant. So in case of real
pmem hardware it should be all good."
So, the real question is: why do have to sync *after* migration on the
migration *destination*?
I think the reason is simple if you assume that the pmem device will
differ between source and destination, and that we actually migrated
that data in the migration stream.
On the migration destination, we will fill pmem with data we obtained
from the src via the migration stream: writing the data to pmem using
ordinary memory writes.
pmem requires a sync to make sure that the data is *actually* persisted.
The VM will certainly not issue a sync, because it didn't modify any
pages. So we have to issue a sync such that pmem is guaranteed to be
persisted.
In case of ordinary files, this means writing data back to disk
("persist on disk"). I'll note that NVDIMMs are not suitable for
ordinary files in general, because we cannot easily implement
guest-triggered pmem syncs using basic instruction set. For R/O NVDIMMs
it's fine.
For the R/W use case, virtio-pmem was invented, whereby the VM will do
the sync -> msync using an explicit guest->host call. So once the guest
sync'ed, it's actually persisted.
Now, NVDIMMs could be safely used in R/O mode backed by ordinary files.
Here, we would *still* want to do this msync.
So, we can really only safely ignore the msync if we know that the
mmap() is R/O (in which case, migration probably would fail either way?
unless the RAMBlock is ignored).
While we could not perform the msync if we detect that we have an
ordinary file, there might still be the case where we have a R/W NVDIMM,
just nobody actually ever writes to it ... so it's tricky. Certainly
worth exploring. But there would be the chance of data loss for R/O
NVDIMMs after migration on hypervisor crash ...
[1]
https://patchew.org/QEMU/20191121000843.24844-1-beata.michal...@linaro.org/
[2]
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01750.html
[3]
https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01772.html
--
Cheers,
David / dhildenb