Hello Fabiano,

On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <faro...@suse.de> wrote:
> Prasad Pandit <ppan...@redhat.com> writes:
> > * Thank you for the reproducer and traces. I'll try to check more and
> > see if I'm able to reproduce it on my side.
>
> Thanks. I cannot merge this series until that issue is resolved. If it
> reproduces on my machine there's a high chance that it will break CI at
> some point and then it'll be a nightmare to debug. This has happened
> many times before with multifd.

===
qemu/build)$ for i in $(seq 1 9999); do echo "$i ====";
QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test
--full -r '/x86_64/migration/postcopy/multifd/plain' || break; done |
tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) };
/slow test/ { printf("%s\n", $_); }'

Host-1]
...
9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.42 secs
9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.56 secs
9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.44 secs
9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.43 secs
9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.53 secs
9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.46 secs
9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.49 secs
9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.41 secs
9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.42 secs
9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.58 secs
9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
--------
Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs


Host-2]
...
9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.69 secs
9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.41 secs
9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.54 secs
9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.44 secs
9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.44 secs
9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.37 secs
9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.45 secs
9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.53 secs
9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.47 secs
9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
--------
Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs


Host-3]
...
9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.50 secs
9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.55 secs
9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.54 secs
9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.49 secs
9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.49 secs
9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.52 secs
9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.52 secs
9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.54 secs
9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.51 secs
9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.50 secs
9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.53 secs
9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.50 secs
9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.53 secs
9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.49 secs
9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.48 secs
9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.54 secs
9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.44 secs
9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
executed in 1.54 secs
--------
Iter: 9999, low: 1.31, high: 2.49, avg: 1.48
===

* I tried to reproduce the hang issue with and without -traces across
3 different machines but am unable to reproduce it on my side.

* Going through the source and the back trace you provided, you said
gdb hangs in the postcopy_ram_fault_thread() function at poll()
function to wait for a missing page.
   - But by this time, postcopy_ram_listen thread is already preparing
to cleanup and exit
       - That means postcopy migration is finished/ending
   - ie. postcopy migration is ending without (or before) migrating
all the RAM pages from the source side?

In postcopy mode:
    * Is there a way to log the pages (#numers) that are sent from the
source side?
    * And log the pages (#numbers) that are received on the receive side?

* That way we might be able to check/confirm the pages which were not
received or not processed properly.

* Can we connect the faulting/missing (HVA=0x7fffd68cc000)
address/page in postcopy_ram_fault_thread() with the memcpy that the
main thread seems to be loading via vapic_post_load()? ie. the main
thread and pocyopy_ram_fault_thread() above could be doing unrelated
things.

* Other than this, I've revised the patch-set as suggested. How do we
proceed further?

Thank you.
---
  - Prasad


Reply via email to