Fabiano Rosas <faro...@suse.de> writes:

> Prasad Pandit <ppan...@redhat.com> writes:
>
>> From: Prasad Pandit <p...@fedoraproject.org>
>>
>>  Hello,
>>
>>
>> * This series (v9) does minor refactoring and reordering changes as
>>   suggested in the review of earlier series (v8). Also tried to
>>   reproduce/debug a qtest hang issue, but it could not be reproduced.
>>   From the shared stack traces it looked like Postcopy thread was
>>   preparing to finish before migrating all the pages.
>
> The issue is that a zero page is being migrated by multifd but there's
> an optimization in place that skips faulting the page in on the
> destination. Later during postcopy when the page is found to be missing,
> postcopy (@migrate_send_rp_req_pages) believes the page is already
> present due to the receivedmap for that pfn being set and thus the code
> accessing the guest memory just sits there waiting for the page.
>
> It seems your series has a logical conflict with this work that was done
> a while back:
>
> https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/
>
> The usage of receivedmap for multifd was supposed to be mutually
> exclusive with postcopy. Take a look at the description of that series
> and at postcopy_place_page_zero(). We need to figure out what needs to
> change and how to do that compatibly. It might just be the case of
> memsetting the zero page always for postcopy, but I havent't thought too
> much about it.
>
> There's also other issues with the series:
>
> https://gitlab.com/farosas/qemu/-/pipelines/1770488059
>
> The CI workers don't support userfaultfd so the tests need to check for
> that properly. We have MigrationTestEnv::has_uffd for that.
>
> Lastly, I have seem some weirdness with TLS channels disconnections
> leading to asserts in qio_channel_shutdown() in my testing. I'll get a
> better look at those tomorrow.

Ok, you can ignore this last paragraph. I was seeing the postcopy
recovery test disconnect messages, those are benign.

Reply via email to