Thomas Huth <th...@redhat.com> writes: > On 19/07/2023 21.56, Milan Zamazal wrote: >> Thomas Huth <th...@redhat.com> writes: >> > >>> On 18/07/2023 14.55, Milan Zamazal wrote: >>>> Thomas Huth <th...@redhat.com> writes: >>>> >>> >>>>> On 11/07/2023 01.02, Michael S. Tsirkin wrote: >>>>>> From: Milan Zamazal <mzama...@redhat.com> >>>>>> We don't have a virtio-scmi implementation in QEMU and only support >>>>> >>>>>> a >>>>>> vhost-user backend. This is very similar to virtio-gpio and we add the >>>>>> same >>>>>> set of tests, just passing some vhost-user messages over the control >>>>>> socket. >>>>>> Signed-off-by: Milan Zamazal <mzama...@redhat.com> >>>>>> Acked-by: Thomas Huth <th...@redhat.com> >>>>>> Message-Id: <20230628100524.342666-4-mzama...@redhat.com> >>>>>> Reviewed-by: Michael S. Tsirkin <m...@redhat.com> >>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> >>>>>> --- >>>>>> tests/qtest/libqos/virtio-scmi.h | 34 ++++++ >>>>>> tests/qtest/libqos/virtio-scmi.c | 174 >>>>>> +++++++++++++++++++++++++++++++ >>>>>> tests/qtest/vhost-user-test.c | 44 ++++++++ >>>>>> MAINTAINERS | 1 + >>>>>> tests/qtest/libqos/meson.build | 1 + >>>>>> 5 files changed, 254 insertions(+) >>>>>> create mode 100644 tests/qtest/libqos/virtio-scmi.h >>>>>> create mode 100644 tests/qtest/libqos/virtio-scmi.c >>>>> >>>>> Hi! >>>>> >>>>> I'm seeing some random failures with this new scmi test, so far only >>>>> on non-x86 systems, e.g.: >>>>> >>>>> https://app.travis-ci.com/github/huth/qemu/jobs/606246131#L4774 >>>>> >>>>> It also reproduces on a s390x host here, but only if I run "make check >>>>> -j$(nproc)" - if I run the tests single-threaded, the qos-test passes >>>>> there. Seems like there is a race somewhere in this test? >>>> Hmm, it's basically the same as virtio-gpio.c test, so it should be >>>> OK. >>>> Is it possible that the two tests (virtio-gpio.c & virtio-scmi.c) >>>> interfere with each other in some way? Is there possibly a way to >>>> serialize them to check? >>> >>> I think within one qos-test, the sub-tests are already run >>> serialized. >> I see, OK. >> >>> But there might be multiple qos-tests running in parallel, e.g. one >>> for the aarch64 target and one for the ppc64 target. And indeed, I can >>> reproduce the problem on my x86 laptop by running this in one terminal >>> window: >>> >>> for ((x=0;x<1000;x++)); do \ >>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon \ >>> G_TEST_DBUS_DAEMON=.tests/dbus-vmstate-daemon.sh \ >>> QTEST_QEMU_BINARY=./qemu-system-ppc64 \ >>> MALLOC_PERTURB_=188 QTEST_QEMU_IMG=./qemu-img \ >>> tests/qtest/qos-test -p \ >>> >>> /ppc64/pseries/spapr-pci-host-bridge/pci-bus-spapr/pci-bus/vhost-user-scmi-pci/vhost-user-scmi/vhost-user-scmi-tests/scmi/read-guest-mem/memfile >>> \ >>> || break ; \ >>> done >>> >>> And this in another terminal window at the same time: >>> >>> for ((x=0;x<1000;x++)); do \ >>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon \ >>> G_TEST_DBUS_DAEMON=.tests/dbus-vmstate-daemon.sh \ >>> QTEST_QEMU_BINARY=./qemu-system-aarch64 \ >>> MALLOC_PERTURB_=188 QTEST_QEMU_IMG=./qemu-img \ >>> tests/qtest/qos-test -p \ >>> >>> /aarch64/virt/generic-pcihost/pci-bus-generic/pci-bus/vhost-user-scmi-pci/vhost-user-scmi/vhost-user-scmi-tests/scmi/read-guest-mem/memfile >>> \ >>> || break ; \ >>> done >>> >>> After a while, the aarch64 test broke with: >>> >>> /aarch64/virt/generic-pcihost/pci-bus-generic/pci-bus/vhost-user-scmi-pci/vhost-user-scmi/vhost-user-scmi-tests/scmi/read-guest-mem/memfile: >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost VQ 0 ring restore failed: -22: Invalid argument >>> (22) >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost VQ 1 ring restore failed: -22: Invalid argument >>> (22) >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost_set_vring_call failed 22 >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost_set_vring_call failed 22 >>> qemu-system-aarch64: Failed to write msg. Wrote -1 instead of 20. >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost VQ 0 ring restore failed: -22: Invalid argument >>> (22) >>> qemu-system-aarch64: Failed to set msg fds. >>> qemu-system-aarch64: vhost VQ 1 ring restore failed: -22: Invalid argument >>> (22) >>> qemu-system-aarch64: ../../devel/qemu/hw/pci/msix.c:659: >>> msix_unset_vector_notifiers: Assertion `dev->msix_vector_use_notifier >>> && dev->msix_vector_release_notifier' failed. >>> ../../devel/qemu/tests/qtest/libqtest.c:200: kill_qemu() detected QEMU >>> death from signal 6 (Aborted) (core dumped) >>> ** >>> ERROR:../../devel/qemu/tests/qtest/qos-test.c:191:subprocess_run_one_test: >>> child process >>> (/aarch64/virt/generic-pcihost/pci-bus-generic/pci-bus/vhost-user-scmi-pci/vhost-user-scmi/vhost-user-scmi-tests/scmi/read-guest-mem/memfile/subprocess >>> [488457]) failed unexpectedly >>> Aborted (core dumped) >> Interesting, good discovery. >> >>> Can you also reproduce it this way? >> Unfortunately not. I ran the loops several times and everything >> passed. >> I tried to compile and run it in a different distro container and it >> passed too. I also haven't been successful in getting any idea how the >> processes could influence each other. >> What OS and what QEMU configure flags did you use to compile and run >> it? > > I'm using RHEL 8 on an older laptop ... and maybe the latter is > related: I just noticed that I can also reproduce the problem by just > running one of the above two for-loop while putting a lot of load on > the machine otherwise, e.g. by running a "make -j$(nproc)" to rebuild > the whole QEMU sources. So it's definitely a race *within* one QEMU > process.
Ah, great, now I can easily reproduce it by running kernel compilation in the background. And I could also check that the supposed fix remedies the problem. I'll post the patch soon. Thank you, Milan