On Thu, 24 Mar 2022 at 11:53, Laurent Vivier <lviv...@redhat.com> wrote: > > On 24/03/2022 12:11, Peter Maydell wrote: > > This is a backtrace from virtio-failover-test, which had hung > > on the s390 gitlab CI runner. Both processes were using CPU, > > so this is some kind of livelock, not a deadlock. > > > > Looking more closely at the virtio-net-failover process, in the function > > test_migrate_off_abort() we have executed 'migrate_cancel' and then go > > into a loop waiting for 'status' to be "cancelled", with aborts if > > it is either "failed" or "active". But the status the QEMU process > > returns is "completed", so we loop forever waiting for a status change > > that will never come (I assume). > > > > It means the migration has been completed before we tried to cancel it. > The test doesn't fail but is not valid. > > Could you try this: > > diff --git a/tests/qtest/virtio-net-failover.c > b/tests/qtest/virtio-net-failover.c > index 80292eecf65f..80cda4ca28ce 100644 > --- a/tests/qtest/virtio-net-failover.c > +++ b/tests/qtest/virtio-net-failover.c > @@ -1425,6 +1425,11 @@ static void test_migrate_off_abort(gconstpointer > opaque) > ret = migrate_status(qts); > > status = qdict_get_str(ret, "status"); > + if (strcmp(status, "completed") == 0) { > + g_test_skip("Failed to cancel the migration"); > + qobject_unref(ret); > + goto out; > + } > if (strcmp(status, "cancelled") == 0) { > qobject_unref(ret); > break; > @@ -1437,6 +1442,7 @@ static void test_migrate_off_abort(gconstpointer opaque) > check_one_card(qts, true, "standby0", MAC_STANDBY0); > check_one_card(qts, true, "primary0", MAC_PRIMARY0); > > +out: > qos_object_destroy((QOSGraphObject *)vdev); > machine_stop(qts); > }
Looks plausible, but I can't currently get this hang to reproduce (it's probably a fairly rare intermittent) so I can't really test a fix in any meaningful way. It looks like there are several other loops in other tests in this file which also need to check for "completed". I would suggest maybe using check_migration_status() instead of hand-rolling loops here, except that that function seems to assert on an unexpected "completed" status whereas we want the test to skip. It could probably be improved to be usable here, though. thanks -- PMM