Fabiano Rosas <faro...@suse.de> writes: > Peter Xu <pet...@redhat.com> writes: > >> On Thu, Oct 05, 2023 at 06:10:20PM -0300, Fabiano Rosas wrote: >>> Peter Xu <pet...@redhat.com> writes: >>> >>> > On Thu, Oct 05, 2023 at 10:37:56AM -0300, Fabiano Rosas wrote: >>> >> >> + /* >>> >> >> + * Make sure both QEMU instances will go into RECOVER stage, >>> >> >> then test >>> >> >> + * kicking them out using migrate-pause. >>> >> >> + */ >>> >> >> + wait_for_postcopy_status(from, "postcopy-recover"); >>> >> >> + wait_for_postcopy_status(to, "postcopy-recover"); >>> >> > >>> >> > Is this wait out of place? I think we're trying to resume too fast >>> >> > after >>> >> > migrate_recover(): >>> >> > >>> >> > # { >>> >> > # "error": { >>> >> > # "class": "GenericError", >>> >> > # "desc": "Cannot resume if there is no paused migration" >>> >> > # } >>> >> > # } >>> >> > >>> >> >>> >> Ugh, sorry about the long lines: >>> >> >>> >> { >>> >> "error": { >>> >> "class": "GenericError", >>> >> "desc": "Cannot resume if there is no paused migration" >>> >> } >>> >> } >>> > >>> > Sorry I didn't get you here. Could you elaborate your question? >>> > >>> >>> The test is sometimes failing with the above message. >>> >>> But indeed my question doesn't make sense. I forgot migrate_recover >>> happens on the destination. Nevermind. >>> >>> The bug is still present nonetheless. We're going into migrate_prepare >>> in some state other than POSTCOPY_PAUSED. >> >> Oh I see. Interestingly I cannot reproduce on my host, just like last >> time.. >> >> What is your setup for running the test? Anything special? Here's my >> cmdline: > > The crudest oneliner: > > for i in $(seq 1 9999); do echo "$i ============="; \ > QTEST_QEMU_BINARY=./qemu-system-x86_64 \ > ./tests/qtest/migration-test -r /x86_64/migration/postcopy/recovery || break > ; done > > I suspect my system has something specific to it that affects the timing > of the tests. But I have no idea what it could be. > > $ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Address sizes: 39 bits physical, 48 bits virtual > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Vendor ID: GenuineIntel > Model name: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz > CPU family: 6 > Model: 141 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 1 > Stepping: 1 > CPU max MHz: 4800.0000 > CPU min MHz: 800.0000 > BogoMIPS: 4992.00 > >> >> $ cat reproduce.sh >> index=$1 >> loop=0 >> >> while :; do >> echo "Starting loop=$loop..." >> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test >> -p /x86_64/migration/postcopy/recovery/double-failures >> if [[ $? != 0 ]]; then >> echo "index $index REPRODUCED (loop=$loop) !" >> break >> fi >> loop=$(( loop + 1 )) >> done >> >> Survives 200+ loops and kept going. >> >> However I think I saw what's wrong here, could you help try below fixup? >> > > Sure. I won't get to it until tomorrow though.
It seems to have fixed the issue. 3500 iterations and still going.