Hi, On Wed, 16 Apr 2025 at 18:29, Fabiano Rosas <faro...@suse.de> wrote: > > The issue is that a zero page is being migrated by multifd but there's > > an optimization in place that skips faulting the page in on the > > destination. Later during postcopy when the page is found to be missing, > > postcopy (@migrate_send_rp_req_pages) believes the page is already > > present due to the receivedmap for that pfn being set and thus the code > > accessing the guest memory just sits there waiting for the page. > > > > It seems your series has a logical conflict with this work that was done > > a while back: > > > > https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/ > > > > The usage of receivedmap for multifd was supposed to be mutually > > exclusive with postcopy. Take a look at the description of that series > > and at postcopy_place_page_zero(). We need to figure out what needs to > > change and how to do that compatibly. It might just be the case of > > memsetting the zero page always for postcopy, but I havent't thought too > > much about it.
=== $ grep -i avx /proc/cpuinfo flags : avx avx2 avx512f avx512dq avx512ifma avx512cd avx512bw avx512vl avx512vbmi avx512_vbmi2 avx512_vnni avx512_bitalg avx512_vpopcntdq avx512_vp2intersect $ $ ./configure --enable-kvm --enable-avx512bw --enable-avx2 --disable-docs --target-list='x86_64-softmmu' $ make -sj10 check-qtest 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test OK 193.80s 81 subtests passed === * One of my machines does seem to support 'avx*' instructions. QEMU is configured and built with the 'avx2' and 'avx512bw' support. Still migration-tests run fine, without any hang issue observed. Not sure why the hang issue is not reproducing on my side. How do you generally build QEMU to run these tests? Does this issue require some specific h/w setup/support? * Not sure how/why page faults happen during the Multifd phase when the guest on the destination is not running. If 'receivedmap' says that page is present, code accessing guest memory should just access whatever is available/present in that space, without waiting. I'll try to see what zero pages do, how page-faults occur during postcopy and how they are serviced. Let's see.. * Another suggestion is, maybe we should review and pull at least the refactoring patches so that in the next revisions we don't have to redo them. We can hold back the "enable multifd and postcopy together" patch that causes this guest hang issue to surface. > > There's also other issues with the series: > > > > https://gitlab.com/farosas/qemu/-/pipelines/1770488059 > > > > The CI workers don't support userfaultfd so the tests need to check for > > that properly. We have MigrationTestEnv::has_uffd for that. > > > > Lastly, I have seem some weirdness with TLS channels disconnections > > leading to asserts in qio_channel_shutdown() in my testing. I'll get a > > better look at those tomorrow. > > Ok, you can ignore this last paragraph. I was seeing the postcopy > recovery test disconnect messages, those are benign. * ie. ignore everything after - "There's also other issues with this series: " ? OR just the last one " ...with TLS channels..." ?? Postcopy tests are added only if (env->has_uffd) check returns true. Thank you. --- - Prasad