Re: live-migration performance regression when using pmem

David Hildenbrand Wed, 14 May 2025 06:59:43 -0700

On 13.05.25 22:11, Michael S. Tsirkin wrote:

On Tue, May 13, 2025 at 07:21:36PM +0200, David Hildenbrand wrote:

On 12.05.25 17:16, Chaney, Ben wrote:

Hello,


          When live migrating to a destination host with pmem there is a very 
long downtime where the guest is paused. In some cases, this can be as high as 
5 minutes, compared to less than one second in the good case.


          Profiling suggests very high activity in this code path:


ffffffffa2956de6 clean_cache_range+0x26 ([kernel.kallsyms])
ffffffffa2359b0f dax_writeback_mapping_range+0x1ef ([kernel.kallsyms])
ffffffffc0c6336d ext4_dax_writepages+0x7d ([kernel.kallsyms])
ffffffffa2242dac do_writepages+0xbc ([kernel.kallsyms])
ffffffffa2235ea6 filemap_fdatawrite_wbc+0x66 ([kernel.kallsyms])
ffffffffa223a896 __filemap_fdatawrite_range+0x46 ([kernel.kallsyms])
ffffffffa223af73 file_write_and_wait_range+0x43 ([kernel.kallsyms])
ffffffffc0c57ecb ext4_sync_file+0xfb ([kernel.kallsyms])
ffffffffa228a331 __do_sys_msync+0x1c1 ([kernel.kallsyms])
ffffffffa2997fe6 do_syscall_64+0x56 ([kernel.kallsyms])
ffffffffa2a00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms])
11ec5f msync+0x4f (/usr/lib/x86_64-linux-gnu/libc.so.6)
675ada qemu_ram_msync+0x8a (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
6873c7 xbzrle_load_cleanup+0x37 (inlined)
6873c7 ram_load_cleanup+0x37 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ff375 qemu_loadvm_state_cleanup+0x55 
(/usr/local/akamai/qemu/bin/qemu-system-x86_64)
500f0b qemu_loadvm_state+0x15b (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
4ecf85 process_incoming_migration_co+0x95 
(/usr/local/akamai/qemu/bin/qemu-system-x86_64)
8b6412 qemu_coroutine_self+0x2 (/usr/local/akamai/qemu/bin/qemu-system-x86_64)
ffffffffffffffff [unknown] ([unknown])


          I was able to resolve the performance issue by removing the call to 
qemu_ram_block_writeback in ram_load_cleanup. This causes the performance to 
return to normal. It looks like this code path was initially added to ensure 
the memory was synchronized if the persistent memory region is backed by an 
NVDIMM device. Does it serve any purpose if pmem is instead backed by standard 
DRAM?


Are you using a read-only NVDIMM?

In that case, I assume we would never need msync.


diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 94bb3ccbe4..819b8ef829 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -153,7 +153,8 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, 
ram_addr_t length);
  /* Clear whole block of mem */
  static inline void qemu_ram_block_writeback(RAMBlock *block)
  {
-    qemu_ram_msync(block, 0, block->used_length);
+    if (!(block->flags & RAM_READONLY))
+        qemu_ram_msync(block, 0, block->used_length);
  }


--
Cheers,

David / dhildenb


I acked the original change but now I don't understand why is it
critical to preserve memory at a random time that has nothing
to do with guest state.
David, maybe you understand?


Let me dig ...

As you said, we originally added pmem_persist() in:


commit 56eb90af39abf66c0e80588a9f50c31e7df7320b (mst/mst-next)
Author: Junyan He <junyan...@intel.com>
Date:   Wed Jul 18 15:48:03 2018 +0800

    migration/ram: ensure write persistence on loading all data to PMEM.

    Because we need to make sure the pmem kind memory data is synced
    after migration, we choose to call pmem_persist() when the migration
    finish. This will make sure the data of pmem is safe and will not
    lose if power is off.

    Signed-off-by: Junyan He <junyan...@intel.com>
    Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
    Reviewed-by: Igor Mammedov <imamm...@redhat.com>
    Reviewed-by: Michael S. Tsirkin <m...@redhat.com>
    Signed-off-by: Michael S. Tsirkin <m...@redhat.com>

Then, we generalized to not using pmem_persist() but doing aqemu_ram_block_writeback() -- that includes a conditional pmem_persist() in:


commit bd108a44bc29cb648dd930564996b0128e66ac01
Author: Beata Michalska <beata.michal...@linaro.org>
Date:   Thu Nov 21 00:08:42 2019 +0000

    migration: ram: Switch to ram block writeback

    Switch to ram block writeback for pmem migration.

    Signed-off-by: Beata Michalska <beata.michal...@linaro.org>
    Reviewed-by: Richard Henderson <richard.hender...@linaro.org>
    Reviewed-by: Alex Bennée <alex.ben...@linaro.org>
    Acked-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
    Message-id: 20191121000843.24844-4-beata.michal...@linaro.org
    Signed-off-by: Peter Maydell <peter.mayd...@linaro.org>

That was part of a patch series "[PATCH 0/4] target/arm: Support forData Cache Clean up to PoP" [1].

At first, it looks like a cleanup, but has the effect of also affectingnon-pmem memory backends.

A discussion [2] includes some reasoning around libpmem not beingaround, and msync being a suitable replacement in that case [3]: "

According to the PMDG man page, pmem_persist is supposed to be

equivalent for the msync. It's just more performant. So in case of realpmem hardware it should be all good."

So, the real question is: why do have to sync *after* migration on themigration *destination*?

I think the reason is simple if you assume that the pmem device willdiffer between source and destination, and that we actually migratedthat data in the migration stream.

On the migration destination, we will fill pmem with data we obtainedfrom the src via the migration stream: writing the data to pmem usingordinary memory writes.

pmem requires a sync to make sure that the data is *actually* persisted.The VM will certainly not issue a sync, because it didn't modify anypages. So we have to issue a sync such that pmem is guaranteed to bepersisted.

In case of ordinary files, this means writing data back to disk("persist on disk"). I'll note that NVDIMMs are not suitable forordinary files in general, because we cannot easily implementguest-triggered pmem syncs using basic instruction set. For R/O NVDIMMsit's fine.

For the R/W use case, virtio-pmem was invented, whereby the VM will dothe sync -> msync using an explicit guest->host call. So once the guestsync'ed, it's actually persisted.

Now, NVDIMMs could be safely used in R/O mode backed by ordinary files.Here, we would *still* want to do this msync.

So, we can really only safely ignore the msync if we know that themmap() is R/O (in which case, migration probably would fail either way?unless the RAMBlock is ignored).

While we could not perform the msync if we detect that we have anordinary file, there might still be the case where we have a R/W NVDIMM,just nobody actually ever writes to it ... so it's tricky. Certainlyworth exploring. But there would be the chance of data loss for R/ONVDIMMs after migration on hypervisor crash ...

[1]https://patchew.org/QEMU/20191121000843.24844-1-beata.michal...@linaro.org/

[2]https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01750.html

[3]https://lists.libreplanet.org/archive/html/qemu-devel/2019-09/msg01772.html


--
Cheers,

David / dhildenb

Re: live-migration performance regression when using pmem

Reply via email to