Prasad Pandit <ppan...@redhat.com> writes:

> Hi,
>
>> On Thu, Apr 17, 2025 at 01:05:37PM -0300, Fabiano Rosas wrote:
>> > It's not that page faults happen during multifd. The page was already
>> > sent during precopy, but multifd-recv didn't write to it, it just marked
>> > the receivedmap. When postcopy starts, the page gets accessed and
>> > faults. Since postcopy is on, the migration wants to request the page
>> > from the source, but it's present in the receivedmap, so it doesn't
>> > ask. No page ever comes and the code hangs waiting for the page fault to
>> > be serviced (or potentially faults continuously? I'm not sure on the
>> > details).
>>
>> I think your previous analysis is correct on the zero pages.  I am not 100%
>> sure if that's the issue but very likely.  I tend to also agree with you
>> that we could skip zero page optimization in multifd code when postcopy is
>> enabled (maybe plus some comment right above..).
>
>    migration/multifd: solve zero page causing multiple page faults
>      -> 
> https://gitlab.com/qemu-project/qemu/-/commit/5ef7e26bdb7eda10d6d5e1b77121be9945e5e550
>
> * Is this the optimization that is causing the migration hang issue?
>
> ===
> diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
> index dbc1184921..00f69ff965 100644
> --- a/migration/multifd-zero-page.c
> +++ b/migration/multifd-zero-page.c
> @@ -85,7 +85,8 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
>  {
>      for (int i = 0; i < p->zero_num; i++) {
>          void *page = p->host + p->zero[i];
> -        if (ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) {
> +        if (!migrate_postcopy() &&
> +            ramblock_recv_bitmap_test_byte_offset(p->block, p->zero[i])) {
>              memset(page, 0, multifd_ram_page_size());
>          } else {
>              ramblock_recv_bitmap_set_offset(p->block, p->zero[i]);
> ===
>
> * Would the above patch help to resolve it?
>
> * Another way could be when the page fault occurs during postcopy
> phase, if we know (from receivedmap) that the faulted page is a
> zero-page, maybe we could write it locally on the destination to
> service the page-fault?
>
> On Thu, 17 Apr 2025 at 21:35, Fabiano Rosas <faro...@suse.de> wrote:
>> Maybe there's a bug in the userfaultfd detection? I'll leave it to you, 
>> here's the error:
>>
>> # Running /ppc64/migration/multifd+postcopy/tcp/plain/cancel
>> # Using machine type: pseries-10.0
>> # starting QEMU: exec ./qemu-system-ppc64 -qtest
>> # {
>> #     "error": {
>> #         "class": "GenericError",
>> #         "desc": "Postcopy is not supported: Userfaultfd not available: 
>> Function not implemented"
>> #     }
>> # }
>
> * It is saying - function not implemented - does the Pseries machine
> not support userfaultfd?
>

We're missing a check on has_uffd for the multifd+postcopy tests.

> Thank you.
> ---
>   - Prasad

Reply via email to