* Thomas Huth (th...@redhat.com) wrote: > On 03/03/2023 13.05, Peter Maydell wrote: > > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <th...@redhat.com> wrote: > > > > > > On 03/03/2023 12.18, Peter Maydell wrote: > > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quint...@redhat.com> wrote: > > > > > > > > > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote: > > > > > > > migration-test has been flaky for a long time, both in CI and > > > > > > > otherwise: > > > > > > > > > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216 > > > > > > > (a FreeBSD job) > > > > > > > 32/648 > > > > > > > ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: > > > > > > > assertion failed: (g_test_timer_elapsed() < > > > > > > > MIGRATION_STATUS_WAIT_TIMEOUT) ERROR > > > > > > > > > > > > > > on a local macos x86 box: > > > > > > > > > > > > > > > > > What is really weird with this failure is that: > > > > > - it only happens on non-x86 > > > > > > > > No, I have seen it on x86 macos, and x86 OpenBSD > > > > > > > > > - on code that is not arch dependent > > > > > - on cancel, what we really do there is close fd's for the multifd > > > > > channel threads to get out of the recv, i.e. again, nothing that > > > > > should be arch dependent. > > > > > > > > I'm pretty sure that it tends to happen when the machine that's > > > > running the test is heavily loaded. You probably have a race condition. > > > > > > I think I can second that. IIRC I've seen it a couple of times on my x86 > > > laptop when running "make check -j$(nproc) SPEED=slow" here. > > > > And another on-x86 failure case, just now, on the FreeBSD x86 CI job: > > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180 > > And FWIW, I just saw this while doing "make vm-build-netbsd J=4": > > ▶ 31/645 > ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: > 'got_stop' should be FALSE ERROR
That one is kind of interesting; this is an auto converge test - so it tries to setup migration so it won't finish, to check that the auto converge kicks in. Except in this case the migration *did* finish without the autoconverge (significantly) kicking in. So I guess any of: a) The CPU thread never got much CPU time so not much dirtying happened. b) The bandwidth calculations might be bad enough/course enough that it's passing the (very low) bandwidth limit due to bad approximation at bandwidth needed. c) The autoconverge jump happens fast enough for that loop to hit the got_stop in the loop time of that loop. I guess we could: i) Reduce the usleep in test_migrate_auto_converge (So it is more likely to correctly drop out of that loop as soon as autoconverge kicks in) ii) Reduce inc_pct so that autoconverge kicks in slower iii) Reduce max-bandwidth in migrate_ensure_non_converge even further. Dave > 31/645 qemu:qtest+qtest-i386 / qtest-i386/migration-test > ERROR 25.21s killed by signal 6 SIGABRT > > > > QTEST_QEMU_BINARY=./qemu-system-i386 MALLOC_PERTURB_=35 > > > > G_TEST_DBUS_DAEMON=/home/qemu/qemu-test.fYHKFz/src/tests/dbus-vmstate-daemon.sh > > > > QTEST_QEMU_IMG=./qemu-img > > > > /home/qemu/qemu-test.fYHKFz/build/tests/qtest/migration-test --tap -k > ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ✀ > ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > stderr: > qemu: thread naming not supported on this host > qemu: thread naming not supported on this host > qemu: thread naming not supported on this host > qemu: thread naming not supported on this host > qemu: thread naming not supported on this host > qemu: thread naming not supported on this host > ** > ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: > 'got_stop' should be FALSE > > (test program exited with status code -6) > > Thomas > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK