Prasad Pandit <ppan...@redhat.com> writes: > Hi, > >> On Thu, Apr 17, 2025 at 01:05:37PM -0300, Fabiano Rosas wrote: >> > It's not that page faults happen during multifd. The page was already >> > sent during precopy, but multifd-recv didn't write to it, it just marked >> > the receivedmap. When postcopy starts, the page gets accessed and >> > faults. Since postcopy is on, the migration wants to request the page >> > from the source, but it's present in the receivedmap, so it doesn't >> > ask. No page ever comes and the code hangs waiting for the page fault to >> > be serviced (or potentially faults continuously? I'm not sure on the >> > details). >> >> I think your previous analysis is correct on the zero pages. I am not 100% >> sure if that's the issue but very likely. I tend to also agree with you >> that we could skip zero page optimization in multifd code when postcopy is >> enabled (maybe plus some comment right above..). > > migration/multifd: solve zero page causing multiple page faults > -> > https://gitlab.com/qemu-project/qemu/-/commit/5ef7e26bdb7eda10d6d5e1b77121be9945e5e550 > > * Is this the optimization that is causing the migration hang issue? > > === > diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c > index dbc1184921..00f69ff965 100644 > --- a/migration/multifd-zero-page.c > +++ b/migration/multifd-zero-page.c > @@ -85,7 +85,8 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p) > { > for (int i = 0; i < p->zero_num; i++) { > void *page = p->host + p->zero[i]; > - if (ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) { > + if (!migrate_postcopy() && > + ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) { > memset(page, 0, multifd_ram_page_size()); > } else { > ramblock_recv_bitmap_set_offset(p->block, p->zero[i]); > === > > * Would the above patch help to resolve it? > > * Another way could be when the page fault occurs during postcopy > phase, if we know (from receivedmap) that the faulted page is a > zero-page, maybe we could write it locally on the destination to > service the page-fault? > > On Thu, 17 Apr 2025 at 21:35, Fabiano Rosas <faro...@suse.de> wrote: >> Maybe there's a bug in the userfaultfd detection? I'll leave it to you, >> here's the error: >> >> # Running /ppc64/migration/multifd+postcopy/tcp/plain/cancel >> # Using machine type: pseries-10.0 >> # starting QEMU: exec ./qemu-system-ppc64 -qtest >> # { >> # "error": { >> # "class": "GenericError", >> # "desc": "Postcopy is not supported: Userfaultfd not available: >> Function not implemented" >> # } >> # } > > * It is saying - function not implemented - does the Pseries machine > not support userfaultfd? >
We're missing a check on has_uffd for the multifd+postcopy tests. > Thank you. > --- > - Prasad