Hi,

> On Thu, Apr 17, 2025 at 01:05:37PM -0300, Fabiano Rosas wrote:
> > It's not that page faults happen during multifd. The page was already
> > sent during precopy, but multifd-recv didn't write to it, it just marked
> > the receivedmap. When postcopy starts, the page gets accessed and
> > faults. Since postcopy is on, the migration wants to request the page
> > from the source, but it's present in the receivedmap, so it doesn't
> > ask. No page ever comes and the code hangs waiting for the page fault to
> > be serviced (or potentially faults continuously? I'm not sure on the
> > details).
>
> I think your previous analysis is correct on the zero pages.  I am not 100%
> sure if that's the issue but very likely.  I tend to also agree with you
> that we could skip zero page optimization in multifd code when postcopy is
> enabled (maybe plus some comment right above..).

   migration/multifd: solve zero page causing multiple page faults
     -> 
https://gitlab.com/qemu-project/qemu/-/commit/5ef7e26bdb7eda10d6d5e1b77121be9945e5e550

* Is this the optimization that is causing the migration hang issue?

===
diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
index dbc1184921..00f69ff965 100644
--- a/migration/multifd-zero-page.c
+++ b/migration/multifd-zero-page.c
@@ -85,7 +85,8 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
 {
     for (int i = 0; i < p->zero_num; i++) {
         void *page = p->host + p->zero[i];
-        if (ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) {
+        if (!migrate_postcopy() &&
+            ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) {
             memset(page, 0, multifd_ram_page_size());
         } else {
             ramblock_recv_bitmap_set_offset(p->block, p->zero[i]);
===

* Would the above patch help to resolve it?

* Another way could be when the page fault occurs during postcopy
phase, if we know (from receivedmap) that the faulted page is a
zero-page, maybe we could write it locally on the destination to
service the page-fault?

On Thu, 17 Apr 2025 at 21:35, Fabiano Rosas <faro...@suse.de> wrote:
> Maybe there's a bug in the userfaultfd detection? I'll leave it to you, 
> here's the error:
>
> # Running /ppc64/migration/multifd+postcopy/tcp/plain/cancel
> # Using machine type: pseries-10.0
> # starting QEMU: exec ./qemu-system-ppc64 -qtest
> # {
> #     "error": {
> #         "class": "GenericError",
> #         "desc": "Postcopy is not supported: Userfaultfd not available: 
> Function not implemented"
> #     }
> # }

* It is saying - function not implemented - does the Pseries machine
not support userfaultfd?

Thank you.
---
  - Prasad


Reply via email to