Fabiano Rosas <faro...@suse.de> writes: > Prasad Pandit <ppan...@redhat.com> writes: > >> From: Prasad Pandit <p...@fedoraproject.org> >> >> Hello, >> >> >> * This series (v9) does minor refactoring and reordering changes as >> suggested in the review of earlier series (v8). Also tried to >> reproduce/debug a qtest hang issue, but it could not be reproduced. >> From the shared stack traces it looked like Postcopy thread was >> preparing to finish before migrating all the pages. > > The issue is that a zero page is being migrated by multifd but there's > an optimization in place that skips faulting the page in on the > destination. Later during postcopy when the page is found to be missing, > postcopy (@migrate_send_rp_req_pages) believes the page is already > present due to the receivedmap for that pfn being set and thus the code > accessing the guest memory just sits there waiting for the page. > > It seems your series has a logical conflict with this work that was done > a while back: > > https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/ > > The usage of receivedmap for multifd was supposed to be mutually > exclusive with postcopy. Take a look at the description of that series > and at postcopy_place_page_zero(). We need to figure out what needs to > change and how to do that compatibly. It might just be the case of > memsetting the zero page always for postcopy, but I havent't thought too > much about it. > > There's also other issues with the series: > > https://gitlab.com/farosas/qemu/-/pipelines/1770488059 > > The CI workers don't support userfaultfd so the tests need to check for > that properly. We have MigrationTestEnv::has_uffd for that. > > Lastly, I have seem some weirdness with TLS channels disconnections > leading to asserts in qio_channel_shutdown() in my testing. I'll get a > better look at those tomorrow.
Ok, you can ignore this last paragraph. I was seeing the postcopy recovery test disconnect messages, those are benign.