On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote: > Fabiano Rosas <faro...@suse.de> writes: > > > Peter Xu <pet...@redhat.com> writes: > > > >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote: > >>> Fabiano Rosas <faro...@suse.de> writes: > >>> > >>> > Peter Xu <pet...@redhat.com> writes: > >>> > > >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote: > >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote: > >>> >>> > I don't think we're discussing total CI time at this point, so the > >>> >>> > math > >>> >>> > doesn't really add up. We're not looking into making the CI finish > >>> >>> > faster. We're looking into making migration-test finish faster. That > >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the > >>> >>> > chance > >>> >>> > of random race conditions* affecting other people/staging runs. > >>> >>> > >>> >>> Right. The reason migration-test appears on my radar is because > >>> >>> it is very frequently the thing that shows up as "this sometimes > >>> >>> just fails or just times out and if you hit retry it goes away > >>> >>> again". That might not be migration-test's fault specifically, > >>> >>> because those retries tend to be certain CI configs (s390, > >>> >>> the i686-tci one), and I have some theories about what might be > >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel, > >>> >>> which means 8 QEMU processes which is too many for the number > >>> >>> of host CPUs). But right now I look at CI job failures and my reaction > >>> >>> is "oh, it's the migration-test failing yet again" :-( > >>> >>> > >>> >>> For some examples from this week: > >>> >>> > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144 > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 <--------[1] > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 <--------[2] > >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155 > >>> >> > >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in > >>> >> parallel. It indeed sounds like no good way to finally solve.. I don't > >>> >> also see how speeding up / reducing tests in migration test would > >>> >> help, as > >>> >> that's (from some degree..) is the same as tuning the timeout value > >>> >> bigger. > >>> >> When the tests are less it'll fit into 480s window, but maybe it's too > >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then > >>> >> it > >>> >> can timeout again when on a busy host with less capability of > >>> >> concurrency. > >>> >> > >>> >> But indeed there're two ERRORs ([1,2] above).. I collected some more > >>> >> info > >>> >> here before the log expires: > >>> >> > >>> >> =================================8<================================ > >>> >> > >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host > >>> >> > >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373 > >>> >> > >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test > >>> >> ERROR 144.32s killed by signal 6 SIGABRT > >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon > >>> >>>>> > >>> >>>>> G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh > >>> >>>>> > >>> >>>>> PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 > >>> >>>>> QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 > >>> >>>>> QTEST_QEMU_BINARY=./qemu-system-i386 > >>> >>>>> /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test > >>> >>>>> --tap -k > >>> >> ――――――――――――――――――――――――――――――――――――― ✀ > >>> >> ――――――――――――――――――――――――――――――――――――― > >>> >> stderr: > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from > >>> >> signal 11 (Segmentation fault) (core dumped) > >>> >> (test program exited with status code -6) > >>> >> TAP parsing error: Too few tests run (expected 53, got 39) > >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >>> >> > >>> >> # Start of plain tests > >>> >> # Running /i386/migration/multifd/tcp/plain/cancel > >>> >> # Using machine type: pc-i440fx-9.2 > >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest > >>> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev > >>> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon > >>> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel > >>> >> tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M > >>> >> -serial file:/tmp/migration-test-4112T2/src_serial -drive > >>> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw > >>> >> -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel > >>> >> qtest > >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest > >>> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev > >>> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon > >>> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel > >>> >> tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M > >>> >> -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer > >>> >> -drive > >>> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw > >>> >> -device ide-hd,drive=d0,secs=1,cyls=1,heads=1 2>/dev/null -accel > >>> >> qtest > >>> >> ----------------------------------- stderr > >>> >> ----------------------------------- > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from > >>> >> signal 11 (Segmentation fault) (core dumped) > >>> >> > >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host > >>> >> > >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152 > >>> >> > >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test > >>> >> ERROR 381.00s killed by signal 6 SIGABRT > >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 > >>> >>>>> QTEST_QEMU_IMG=./qemu-img > >>> >>>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh > >>> >>>>> QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 > >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon > >>> >>>>> /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >>> >> ――――――――――――――――――――――――――――――――――――― ✀ > >>> >> ――――――――――――――――――――――――――――――――――――― > >>> >> stderr: > >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection > >>> >> was non-properly terminated. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from > >>> >> signal 11 (Segmentation fault) (core dumped) > >>> >> (test program exited with status code -6) > >>> >> TAP parsing error: Too few tests run (expected 61, got 47) > >>> >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > >>> >> > >>> >> # Start of plain tests > >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel > >>> >> # Using machine type: pseries-9.2 > >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest > >>> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev > >>> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon > >>> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel > >>> >> tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M > >>> >> -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults > >>> >> -machine > >>> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, > >>> >> -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest > >>> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev > >>> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon > >>> >> chardev=char0,mode=control -display none -audio none -accel kvm -accel > >>> >> tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M > >>> >> -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer > >>> >> -nodefaults -machine > >>> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, > >>> >> -bios /tmp/migration-test-H0Z1T2/bootsect 2>/dev/null -accel qtest > >>> >> ----------------------------------- stderr > >>> >> ----------------------------------- > >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection > >>> >> was non-properly terminated. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> warning: fd: migration to a file is deprecated. Use file: instead. > >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from > >>> >> signal 11 (Segmentation fault) (core dumped) > >>> >> > >>> >> (test program exited with status code -6) > >>> >> =================================8<================================ > >>> >> > >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on > >>> >> different host / arch being tested. What is more weird is the two > >>> >> failures > >>> >> are different, the 2nd failure throw out a TLS error even though the > >>> >> test > >>> >> doesn't yet have tls involved. > >>> > > >>> > I think that's just a parallel test being cancelled prematurely, either > >>> > due to the crash or due to the timeout. > >>> > > >>> >> > >>> >> Fabiano, is this the issue you're looking at? > >>> > > >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop > >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It > >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and > >>> > aarch64. > >>> > > >>> > >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve > >>> zero page causing multiple page faults"), the multifd code started using > >>> the rb->receivedmap bitmap, which belongs to the ram code and is > >>> initialized and *freed* from the ram SaveVMHandlers. > >>> > >>> process_incoming_migration_co() ... > >>> qemu_loadvm_state() multifd_nocomp_recv() > >>> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > >>> rb->receivedmap = NULL set_bit_atomic(..., > >>> rb->receivedmap) > >>> ... > >>> migration_incoming_state_destroy() > >>> multifd_recv_cleanup() > >>> multifd_recv_terminate_threads(NULL) > >>> > >>> Multifd threads are live until migration_incoming_state_destroy(), which > >>> is called some time later. > >> > >> Thanks for the debugging. Hmm I would expect loadvm should wait until all > >> ram is received somehow.. > > > > Looks like a similar issue as when we didn't have the multifd_send sync > > working correctly and ram code would run and do cleanup. > > Btw, this is hard to debug, but I bet what's happening is that the > ram_load code itself is exiting due to qemufile error. So there wouldn't > be a way to make it wait for multifd.
One more thing is I remember one of the error is not the crash but some TLS disconnection error. I wonder which one you can reproduce and why that TLS code can got kicked off in the multifd cancel test. Perhaps the memory was simply corrupted around, so bitmap ops can write to some other memory? -- Peter Xu