Peter Xu <pet...@redhat.com> writes:

> On Fri, Sep 13, 2024 at 12:17:40PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <faro...@suse.de> writes:
>> 
>> > Peter Xu <pet...@redhat.com> writes:
>> >
>> >> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
>> >>> Fabiano Rosas <faro...@suse.de> writes:
>> >>> 
>> >>> > Peter Xu <pet...@redhat.com> writes:
>> >>> >
>> >>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>> >>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <faro...@suse.de> wrote:
>> >>> >>> > I don't think we're discussing total CI time at this point, so the 
>> >>> >>> > math
>> >>> >>> > doesn't really add up. We're not looking into making the CI finish
>> >>> >>> > faster. We're looking into making migration-test finish faster. 
>> >>> >>> > That
>> >>> >>> > would reduce timeouts in CI, speed-up make check and reduce the 
>> >>> >>> > chance
>> >>> >>> > of random race conditions* affecting other people/staging runs.
>> >>> >>> 
>> >>> >>> Right. The reason migration-test appears on my radar is because
>> >>> >>> it is very frequently the thing that shows up as "this sometimes
>> >>> >>> just fails or just times out and if you hit retry it goes away
>> >>> >>> again". That might not be migration-test's fault specifically,
>> >>> >>> because those retries tend to be certain CI configs (s390,
>> >>> >>> the i686-tci one), and I have some theories about what might be
>> >>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
>> >>> >>> which means 8 QEMU processes which is too many for the number
>> >>> >>> of host CPUs). But right now I look at CI job failures and my 
>> >>> >>> reaction
>> >>> >>> is "oh, it's the migration-test failing yet again" :-(
>> >>> >>> 
>> >>> >>> For some examples from this week:
>> >>> >>> 
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>> >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>> >>> >>
>> >>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run 
>> >>> >> in
>> >>> >> parallel.  It indeed sounds like no good way to finally solve.. I 
>> >>> >> don't
>> >>> >> also see how speeding up / reducing tests in migration test would 
>> >>> >> help, as
>> >>> >> that's (from some degree..) is the same as tuning the timeout value 
>> >>> >> bigger.
>> >>> >> When the tests are less it'll fit into 480s window, but maybe it's too
>> >>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then 
>> >>> >> it
>> >>> >> can timeout again when on a busy host with less capability of 
>> >>> >> concurrency.
>> >>> >>
>> >>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more 
>> >>> >> info
>> >>> >> here before the log expires:
>> >>> >>
>> >>> >> =================================8<================================
>> >>> >>
>> >>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>> >>> >>
>> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>> >>> >>
>> >>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test             
>> >>> >>             ERROR          144.32s   killed by signal 6 SIGABRT
>> >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon
>> >>> >>>>>  
>> >>> >>>>> G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
>> >>> >>>>>  
>> >>> >>>>> PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3
>> >>> >>>>>  QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 
>> >>> >>>>> QTEST_QEMU_BINARY=./qemu-system-i386 
>> >>> >>>>> /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test
>> >>> >>>>>  --tap -k
>> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  
>> >>> >> ―――――――――――――――――――――――――――――――――――――
>> >>> >> stderr:
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from 
>> >>> >> signal 11 (Segmentation fault) (core dumped)
>> >>> >> (test program exited with status code -6)
>> >>> >> TAP parsing error: Too few tests run (expected 53, got 39)
>> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>> >>
>> >>> >> # Start of plain tests
>> >>> >> # Running /i386/migration/multifd/tcp/plain/cancel
>> >>> >> # Using machine type: pc-i440fx-9.2
>> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest 
>> >>> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev 
>> >>> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon 
>> >>> >> chardev=char0,mode=control -display none -audio none -accel kvm 
>> >>> >> -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 
>> >>> >> 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive 
>> >>> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw 
>> >>> >> -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel 
>> >>> >> qtest
>> >>> >> # starting QEMU: exec ./qemu-system-i386 -qtest 
>> >>> >> unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev 
>> >>> >> socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon 
>> >>> >> chardev=char0,mode=control -display none -audio none -accel kvm 
>> >>> >> -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 
>> >>> >> 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming 
>> >>> >> defer -drive 
>> >>> >> if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw 
>> >>> >> -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel 
>> >>> >> qtest
>> >>> >> ----------------------------------- stderr 
>> >>> >> -----------------------------------
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from 
>> >>> >> signal 11 (Segmentation fault) (core dumped)
>> >>> >>
>> >>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 
>> >>> >> host
>> >>> >>
>> >>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>> >>> >>
>> >>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test           
>> >>> >>             ERROR          381.00s   killed by signal 6 SIGABRT
>> >>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 
>> >>> >>>>> QTEST_QEMU_IMG=./qemu-img 
>> >>> >>>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh
>> >>> >>>>>  QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 
>> >>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon
>> >>> >>>>>  /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap 
>> >>> >>>>> -k
>> >>> >> ――――――――――――――――――――――――――――――――――――― ✀  
>> >>> >> ―――――――――――――――――――――――――――――――――――――
>> >>> >> stderr:
>> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection 
>> >>> >> was non-properly terminated.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from 
>> >>> >> signal 11 (Segmentation fault) (core dumped)
>> >>> >> (test program exited with status code -6)
>> >>> >> TAP parsing error: Too few tests run (expected 61, got 47)
>> >>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>> >>
>> >>> >> # Start of plain tests
>> >>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
>> >>> >> # Using machine type: pseries-9.2
>> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest 
>> >>> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev 
>> >>> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon 
>> >>> >> chardev=char0,mode=control -display none -audio none -accel kvm 
>> >>> >> -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on 
>> >>> >> -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial 
>> >>> >> -nodefaults -machine 
>> >>> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, 
>> >>> >> -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest 
>> >>> >> unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev 
>> >>> >> socket,path=/tmp/qtest-40766.qmp,id=char0 -mon 
>> >>> >> chardev=char0,mode=control -display none -audio none -accel kvm 
>> >>> >> -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on 
>> >>> >> -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming 
>> >>> >> defer -nodefaults -machine 
>> >>> >> cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, 
>> >>> >> -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >>> >> ----------------------------------- stderr 
>> >>> >> -----------------------------------
>> >>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection 
>> >>> >> was non-properly terminated.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from 
>> >>> >> signal 11 (Segmentation fault) (core dumped)
>> >>> >>
>> >>> >> (test program exited with status code -6)
>> >>> >> =================================8<================================
>> >>> >>
>> >>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>> >>> >> different host / arch being tested.  What is more weird is the two 
>> >>> >> failures
>> >>> >> are different, the 2nd failure throw out a TLS error even though the 
>> >>> >> test
>> >>> >> doesn't yet have tls involved.
>> >>> >
>> >>> > I think that's just a parallel test being cancelled prematurely, either
>> >>> > due to the crash or due to the timeout.
>> >>> >
>> >>> >>
>> >>> >> Fabiano, is this the issue you're looking at?
>> >>> >
>> >>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
>> >>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
>> >>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
>> >>> > aarch64.
>> >>> >
>> >>> 
>> >>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
>> >>> zero page causing multiple page faults"), the multifd code started using
>> >>> the rb->receivedmap bitmap, which belongs to the ram code and is
>> >>> initialized and *freed* from the ram SaveVMHandlers.
>> >>> 
>> >>> process_incoming_migration_co()        ...
>> >>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>> >>>     qemu_loadvm_state_cleanup()          
>> >>> ramblock_recv_bitmap_set_offset()
>> >>>       rb->receivedmap = NULL               set_bit_atomic(..., 
>> >>> rb->receivedmap)
>> >>>   ...
>> >>>   migration_incoming_state_destroy()
>> >>>     multifd_recv_cleanup()
>> >>>       multifd_recv_terminate_threads(NULL)
>> >>> 
>> >>> Multifd threads are live until migration_incoming_state_destroy(), which
>> >>> is called some time later.
>> >>
>> >> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
>> >> ram is received somehow..
>> >
>> > Looks like a similar issue as when we didn't have the multifd_send sync
>> > working correctly and ram code would run and do cleanup.
>> 
>> Btw, this is hard to debug, but I bet what's happening is that the
>> ram_load code itself is exiting due to qemufile error. So there wouldn't
>> be a way to make it wait for multifd.
>
> One more thing is I remember one of the error is not the crash but some TLS
> disconnection error.  I wonder which one you can reproduce and why that TLS
> code can got kicked off in the multifd cancel test.  Perhaps the memory was
> simply corrupted around, so bitmap ops can write to some other memory?

I haven't seen that error. I said in another email that I think that is
due to the timeout cancelling the tests forcefully.

Reply via email to