Hi, > On Thu, Apr 17, 2025 at 01:05:37PM -0300, Fabiano Rosas wrote: > > It's not that page faults happen during multifd. The page was already > > sent during precopy, but multifd-recv didn't write to it, it just marked > > the receivedmap. When postcopy starts, the page gets accessed and > > faults. Since postcopy is on, the migration wants to request the page > > from the source, but it's present in the receivedmap, so it doesn't > > ask. No page ever comes and the code hangs waiting for the page fault to > > be serviced (or potentially faults continuously? I'm not sure on the > > details). > > I think your previous analysis is correct on the zero pages. I am not 100% > sure if that's the issue but very likely. I tend to also agree with you > that we could skip zero page optimization in multifd code when postcopy is > enabled (maybe plus some comment right above..).
migration/multifd: solve zero page causing multiple page faults -> https://gitlab.com/qemu-project/qemu/-/commit/5ef7e26bdb7eda10d6d5e1b77121be9945e5e550 * Is this the optimization that is causing the migration hang issue? === diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c index dbc1184921..00f69ff965 100644 --- a/migration/multifd-zero-page.c +++ b/migration/multifd-zero-page.c @@ -85,7 +85,8 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p) { for (int i = 0; i < p->zero_num; i++) { void *page = p->host + p->zero[i]; - if (ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) { + if (!migrate_postcopy() && + ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) { memset(page, 0, multifd_ram_page_size()); } else { ramblock_recv_bitmap_set_offset(p->block, p->zero[i]); === * Would the above patch help to resolve it? * Another way could be when the page fault occurs during postcopy phase, if we know (from receivedmap) that the faulted page is a zero-page, maybe we could write it locally on the destination to service the page-fault? On Thu, 17 Apr 2025 at 21:35, Fabiano Rosas <faro...@suse.de> wrote: > Maybe there's a bug in the userfaultfd detection? I'll leave it to you, > here's the error: > > # Running /ppc64/migration/multifd+postcopy/tcp/plain/cancel > # Using machine type: pseries-10.0 > # starting QEMU: exec ./qemu-system-ppc64 -qtest > # { > # "error": { > # "class": "GenericError", > # "desc": "Postcopy is not supported: Userfaultfd not available: > Function not implemented" > # } > # } * It is saying - function not implemented - does the Pseries machine not support userfaultfd? Thank you. --- - Prasad