Prasad Pandit <ppan...@redhat.com> writes: > Hi, > > On Wed, 16 Apr 2025 at 18:29, Fabiano Rosas <faro...@suse.de> wrote: >> > The issue is that a zero page is being migrated by multifd but there's >> > an optimization in place that skips faulting the page in on the >> > destination. Later during postcopy when the page is found to be missing, >> > postcopy (@migrate_send_rp_req_pages) believes the page is already >> > present due to the receivedmap for that pfn being set and thus the code >> > accessing the guest memory just sits there waiting for the page. >> > >> > It seems your series has a logical conflict with this work that was done >> > a while back: >> > >> > https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/ >> > >> > The usage of receivedmap for multifd was supposed to be mutually >> > exclusive with postcopy. Take a look at the description of that series >> > and at postcopy_place_page_zero(). We need to figure out what needs to >> > change and how to do that compatibly. It might just be the case of >> > memsetting the zero page always for postcopy, but I havent't thought too >> > much about it. > > === > $ grep -i avx /proc/cpuinfo > flags : avx avx2 avx512f avx512dq avx512ifma avx512cd avx512bw > avx512vl avx512vbmi avx512_vbmi2 avx512_vnni avx512_bitalg > avx512_vpopcntdq avx512_vp2intersect > $ > $ ./configure --enable-kvm --enable-avx512bw --enable-avx2 > --disable-docs --target-list='x86_64-softmmu' > $ make -sj10 check-qtest > 67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test > OK 193.80s 81 subtests passed > === > > * One of my machines does seem to support 'avx*' instructions. QEMU is > configured and built with the 'avx2' and 'avx512bw' support. Still > migration-tests run fine, without any hang issue observed. Not sure > why the hang issue is not reproducing on my side. How do you generally > build QEMU to run these tests? Does this issue require some specific > h/w setup/support? >
There's nothing unusual here that I know of. Configure line is just --target-list=x86_64-softmmu --enable-debug --disable-docs --disable-plugins. > * Not sure how/why page faults happen during the Multifd phase when > the guest on the destination is not running. If 'receivedmap' says > that page is present, code accessing guest memory should just access > whatever is available/present in that space, without waiting. I'll try > to see what zero pages do, how page-faults occur during postcopy and > how they are serviced. Let's see.. It's not that page faults happen during multifd. The page was already sent during precopy, but multifd-recv didn't write to it, it just marked the receivedmap. When postcopy starts, the page gets accessed and faults. Since postcopy is on, the migration wants to request the page from the source, but it's present in the receivedmap, so it doesn't ask. No page ever comes and the code hangs waiting for the page fault to be serviced (or potentially faults continuously? I'm not sure on the details). > > * Another suggestion is, maybe we should review and pull at least the > refactoring patches so that in the next revisions we don't have to > redo them. We can hold back the "enable multifd and postcopy together" > patch that causes this guest hang issue to surface. > That's reasonable. But I won't be available for the next two weeks. Peter is going to be back in the meantime, let's hear what he has to say about this postcopy issue. I'll provide my r-bs. >> > There's also other issues with the series: >> > >> > https://gitlab.com/farosas/qemu/-/pipelines/1770488059 >> > >> > The CI workers don't support userfaultfd so the tests need to check for >> > that properly. We have MigrationTestEnv::has_uffd for that. >> > >> > Lastly, I have seem some weirdness with TLS channels disconnections >> > leading to asserts in qio_channel_shutdown() in my testing. I'll get a >> > better look at those tomorrow. >> >> Ok, you can ignore this last paragraph. I was seeing the postcopy >> recovery test disconnect messages, those are benign. > > * ie. ignore everything after - "There's also other issues with this > series: " ? OR just the last one " ...with TLS channels..." ?? > Postcopy tests are added only if (env->has_uffd) check returns true. > Only the TLS part. The CI is failing with just this series. I didn't change anything there. Maybe there's a bug in the userfaultfd detection? I'll leave it to you, here's the error: # Running /ppc64/migration/multifd+postcopy/tcp/plain/cancel # Using machine type: pseries-10.0 # starting QEMU: exec ./qemu-system-ppc64 -qtest # unix:/tmp/qtest-1305.sock -qtest-log /dev/null -chardev # socket,path=/tmp/qtest-1305.qmp,id=char0 -mon # chardev=char0,mode=control -display none -audio none -accel kvm -accel # tcg -machine pseries-10.0,vsmt=8 -name source,debug-threads=on -m 256M # -serial file:/tmp/migration-test-X0SO42/src_serial -nodefaults # -machine # cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, # -bios /tmp/migration-test-X0SO42/bootsect 2>/dev/null -accel qtest # starting QEMU: exec ./qemu-system-ppc64 -qtest # unix:/tmp/qtest-1305.sock -qtest-log /dev/null -chardev # socket,path=/tmp/qtest-1305.qmp,id=char0 -mon # chardev=char0,mode=control -display none -audio none -accel kvm -accel # tcg -machine pseries-10.0,vsmt=8 -name target,debug-threads=on -m 256M # -serial file:/tmp/migration-test-X0SO42/dest_serial -incoming defer # -nodefaults -machine # cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, # -bios /tmp/migration-test-X0SO42/bootsect 2>/dev/null -accel qtest # { # "error": { # "class": "GenericError", # "desc": "Postcopy is not supported: Userfaultfd not available: Function not implemented" # } # }