Hi,

On Wed, 16 Apr 2025 at 18:29, Fabiano Rosas <faro...@suse.de> wrote:
> > The issue is that a zero page is being migrated by multifd but there's
> > an optimization in place that skips faulting the page in on the
> > destination. Later during postcopy when the page is found to be missing,
> > postcopy (@migrate_send_rp_req_pages) believes the page is already
> > present due to the receivedmap for that pfn being set and thus the code
> > accessing the guest memory just sits there waiting for the page.
> >
> > It seems your series has a logical conflict with this work that was done
> > a while back:
> >
> > https://lore.kernel.org/all/20240401154110.2028453-1-yuan1....@intel.com/
> >
> > The usage of receivedmap for multifd was supposed to be mutually
> > exclusive with postcopy. Take a look at the description of that series
> > and at postcopy_place_page_zero(). We need to figure out what needs to
> > change and how to do that compatibly. It might just be the case of
> > memsetting the zero page always for postcopy, but I havent't thought too
> > much about it.

===
$ grep -i avx /proc/cpuinfo
flags        : avx avx2 avx512f avx512dq avx512ifma avx512cd avx512bw
avx512vl avx512vbmi avx512_vbmi2 avx512_vnni avx512_bitalg
avx512_vpopcntdq avx512_vp2intersect
$
$ ./configure --enable-kvm --enable-avx512bw --enable-avx2
--disable-docs --target-list='x86_64-softmmu'
$ make -sj10 check-qtest
67/67 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
     OK             193.80s   81 subtests passed
===

* One of my machines does seem to support 'avx*' instructions. QEMU is
configured and built with the 'avx2' and 'avx512bw' support. Still
migration-tests run fine, without any hang issue observed. Not sure
why the hang issue is not reproducing on my side. How do you generally
build QEMU to run these tests?  Does this issue require some specific
h/w setup/support?

* Not sure how/why page faults happen during the Multifd phase when
the guest on the destination is not running. If 'receivedmap' says
that page is present, code accessing guest memory should just access
whatever is available/present in that space, without waiting. I'll try
to see what zero pages do, how page-faults occur during postcopy and
how they are serviced. Let's see..

* Another suggestion is, maybe we should review and pull at least the
refactoring patches so that in the next revisions we don't have to
redo them. We can hold back the "enable multifd and postcopy together"
patch that causes this guest hang issue to surface.

> > There's also other issues with the series:
> >
> > https://gitlab.com/farosas/qemu/-/pipelines/1770488059
> >
> > The CI workers don't support userfaultfd so the tests need to check for
> > that properly. We have MigrationTestEnv::has_uffd for that.
> >
> > Lastly, I have seem some weirdness with TLS channels disconnections
> > leading to asserts in qio_channel_shutdown() in my testing. I'll get a
> > better look at those tomorrow.
>
> Ok, you can ignore this last paragraph. I was seeing the postcopy
> recovery test disconnect messages, those are benign.

* ie. ignore everything after - "There's also other issues with this
series: " ?  OR just the last one " ...with TLS channels..." ??
Postcopy tests are added only if (env->has_uffd) check returns true.

Thank you.
---
  - Prasad


Reply via email to