Prasad Pandit <ppan...@redhat.com> writes:

> From: Prasad Pandit <p...@fedoraproject.org>
>
>  Hello,
>
>
> * This series (v9) does minor refactoring and reordering changes as
>   suggested in the review of earlier series (v8). Also tried to
>   reproduce/debug a qtest hang issue, but it could not be reproduced.
>   From the shared stack traces it looked like Postcopy thread was
>   preparing to finish before migrating all the pages.

The issue is that a zero page is being migrated by multifd but there's
an optimization in place that skips faulting the page in on the
destination. Later during postcopy when the page is found to be missing,
postcopy (@migrate_send_rp_req_pages) believes the page is already
present due to the receivedmap for that pfn being set and thus the code
accessing the guest memory just sits there waiting for the page.

It seems your series has a logical conflict with this work that was done
a while back:

https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/

The usage of receivedmap for multifd was supposed to be mutually
exclusive with postcopy. Take a look at the description of that series
and at postcopy_place_page_zero(). We need to figure out what needs to
change and how to do that compatibly. It might just be the case of
memsetting the zero page always for postcopy, but I havent't thought too
much about it.

There's also other issues with the series:

https://gitlab.com/farosas/qemu/-/pipelines/1770488059

The CI workers don't support userfaultfd so the tests need to check for
that properly. We have MigrationTestEnv::has_uffd for that.

Lastly, I have seem some weirdness with TLS channels disconnections
leading to asserts in qio_channel_shutdown() in my testing. I'll get a
better look at those tomorrow.


Reply via email to