Hello Fabiano, On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <faro...@suse.de> wrote: > Prasad Pandit <ppan...@redhat.com> writes: > > * Thank you for the reproducer and traces. I'll try to check more and > > see if I'm able to reproduce it on my side. > > Thanks. I cannot merge this series until that issue is resolved. If it > reproduces on my machine there's a high chance that it will break CI at > some point and then it'll be a nightmare to debug. This has happened > many times before with multifd.
=== qemu/build)$ for i in $(seq 1 9999); do echo "$i ===="; QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test --full -r '/x86_64/migration/postcopy/multifd/plain' || break; done | tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) }; /slow test/ { printf("%s\n", $_); }' Host-1] ... 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.42 secs 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.56 secs 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.44 secs 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.43 secs 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.53 secs 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.46 secs 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.49 secs 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.41 secs 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.42 secs 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.58 secs 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs -------- Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs Host-2] ... 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.69 secs 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.41 secs 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.54 secs 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.44 secs 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.44 secs 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.37 secs 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.45 secs 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.53 secs 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.47 secs 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs -------- Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs Host-3] ... 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.50 secs 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.55 secs 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.54 secs 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.49 secs 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.49 secs 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.52 secs 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.52 secs 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.54 secs 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.51 secs 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.50 secs 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.53 secs 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.50 secs 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.53 secs 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.49 secs 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.48 secs 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.54 secs 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.44 secs 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain executed in 1.54 secs -------- Iter: 9999, low: 1.31, high: 2.49, avg: 1.48 === * I tried to reproduce the hang issue with and without -traces across 3 different machines but am unable to reproduce it on my side. * Going through the source and the back trace you provided, you said gdb hangs in the postcopy_ram_fault_thread() function at poll() function to wait for a missing page. - But by this time, postcopy_ram_listen thread is already preparing to cleanup and exit - That means postcopy migration is finished/ending - ie. postcopy migration is ending without (or before) migrating all the RAM pages from the source side? In postcopy mode: * Is there a way to log the pages (#numers) that are sent from the source side? * And log the pages (#numbers) that are received on the receive side? * That way we might be able to check/confirm the pages which were not received or not processed properly. * Can we connect the faulting/missing (HVA=0x7fffd68cc000) address/page in postcopy_ram_fault_thread() with the memcpy that the main thread seems to be loading via vapic_post_load()? ie. the main thread and pocyopy_ram_fault_thread() above could be doing unrelated things. * Other than this, I've revised the patch-set as suggested. How do we proceed further? Thank you. --- - Prasad